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Objective 

The objective was lo determine appropriate method* for linking parameters of test item* under a 
variety of testing conditions. 

Background 

Computerized adaptive testing (CAT) is a form of test administration that the Armed Services may 
-oon implement. I. require, that large numbers of items be calibrated and stored in Hem banks from 
whith specific items are drawn adap.ively by the computers for each tes.ee. Because the number of ..ems 
to be calibrated is so large. .. is no. feasible lo administer all of .hen. lo a single group, and to .he .terns 
mus. becalibra.ed in separate se.s and then linked together onto a common scale. Four d.ffcren. me.hods 
of linking the item set. were devised and evaluated. 

Approach 

In an evah.at.on of the adequacy of various linking me.hods. the .rue iter" parameters mU st be 
known. These were obtained through a cc. puter simulation stud) wilh a desit i.ased on a practical 
testing environment. 

Specifics 

We/W A simulalion studv was designed in which simulated test .terns were defined lo he similar in 
terms of their item parameters lo Armed Servires test items, and popnlalions of s.n.ulaled exan>...»es were 
defined to be similar in abilil) lo those individuals lik. K to lake Armed Serv.ces tests. 

Four linking methods were evaluated. The equnalent-groups metluni linked items bj assuming 
examinee groups to be equivalent. The equivalent-Ms method assumed tests to contain eq.nvale.it rtrmi. 
The anchorJoun methotl linked through a coirmon group of examinee-. The anchor-test method linked 
through a common set of items. These me.hods *erc compared ... each o.her and 10 a rood. ..on in which 
no explic it linking was. done. 

Th.ee r.nking conditions were Emulated One was .he condition in which WA booklets were 
raodo lib diMrib.ned among .he en'ire population. W.her was .he condition in which .es. booklets were 
distributed sxstematieall) among relatively few toting centers. The final cond-.ion was one ... which a 
population of examinees selected on the basis of their scores was used. 

Three categories of evaluative criteria were used. Fidelity -of-paran.e.er-es.ima.ion criteria examined 
the relations between .rue and estimated item parameters. As) mplolic-abilil) -estimate criteria examined 
the relations between .he .rue and asymptotic (i.e.. infinile-lcsl-lcngth) ab.l.l) estimates. Kff.c.ency-ot- 
ahilil)-estima.ioii criteria included average item infornialioii and relative efficiency. 

/Wi«ir> <tnd discussion. Despite its si,..plicii>. the equnalenl-gioups method worked well under 
inos, testing conditions. The anr/i»r-gr««/» and anvhor-tes, methods were slight!) superior when the 
assu.,.p.ion of equ.valenl groups was v.ola.ed. The equivalents method** >?? *™ 

Ihan Ihe other three methods. Modal-Baycsian scoring of lesls generall) produced better linking n suits 
than did maximum-likelihood .-coring. 

Conclusion* 

Two procedures can be recommended for linking. Linking during development of the initial item 
pool can most efficient!, be accomplished using the equtvalent-groups method with examinees random!) 
Uvlcd from the general calibrat.cn population, liens added to the pool at a later date should be hnked 
using 'lie nmlior-lest method. 
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I. INTRODUCTION 



Dyring the past decade, an extensive investigation of adaptive 
tc-sting has been conducted. In its simplest form, adaptive testing 
amounts to administering the subset of items, selected from a larger 
pool, that provides the most information about the individual re- 
garding the characteristic the test measures. A summary of the cur- 
rent state of the art, extracted from the 1979 Computerized Adaptive 
Testing Conference (Weiss, 1980), is that adaptive testing potentially 
offers several advantages over conventional testing methods, but to 
realize these advantages, characteristics of the items comprising the 
pool mu3t be accurately determined. 

Most adaptive testing technology is built on the framework of 
Item Response Theory (IRT), also called Latent Trait Theory or Item 
Characteristic Curve (ICC) Theory. In IRT, test items are described 
by a set of item parameters. It is these parameters that must be 
accurately determined if adaptive testing is to be effective. This 
determination is called item calibration. Because adapt^e testing 
requires a large item pool, and because item calibration Requires ad- 
ministration to a large number of examinees, calibration must often be 
accomplished in parts such that different groups of individuals take 
different sets of items. 

The purposes of the project were to determine efficient methods 
of partitioning the calibration examinee samples and item sets, and 
to determine efficient methods of re-assembling or linking the parts 
into a common whole once the individual calibrations are accomplished. 
As background to the research, the first section of this report re- 
views some of the concepts basic to calibration and linking. Pre- 
vious research, its shortcomings and unanswered questions, will be 
reviewed and discussed. In subsequent sections, a research design to 
eliminate these shortcomings will be described and research conducted 
according to that design will be reported. 



Overview of Item Response Theory 

Item Response Theory has been called the psychometric equiva- 
lent of Einstein's Theory of Relativity (Warm, 1973). Stated simply, 
IRT specifies a general mathematical relationship between an indi- 
vidual's status on an underlying trait, characteristics of a test 
item, and the probabilities regarding how the individual will respond 
to the item. The term IRT actually refers to a general class of 
psychometric models* Included in the class are models for use when 
the response is dlchotomous (Lord 4 Novick, 1968; Birnbaum, 1968), 
models for use when the response is polychotomous ( Same jima, 1969, 
1972; Bock, 1972), and models for use when the response is continuous 
(Samejima, 1974). These models have typically been developed for use 
where a unidimensional trait is measured. Extension of each to 
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multidimensional traits would double the number of available models. 
Hambleton 4 Cook (1977) present an overview of most of the unidimen- 
slonal IRT models. 

All the Item domains considered by the current research con- 
tained dlchotoraous ability Items of a multiple-choice nature. Two IRT 
models are appropriate for such Items: the three-parameter normal and 
logistic ogive models. For reasons of mathematical tractabillty , the 
logistic model is generally preferred over the normal model and will 
be of primary focus throughout this report. A single-parameter degen- 
erate case of the three-parameter logistic model, the Rasch model, 
will be included in some parts of this review because of its similar- 
ity to the three-parameter logistic model and because more research 
has been done on calibration and linking using the Rasch model than 
has been done using the three-parameter logistic model. 

In the three-parameter logistic model, the item is characterized 
by the three parameters a, b, and c. Ability is characterized by a 
single parameter, theta. The a parameter is an index of the item's 
power to discriminate among different levels of ability. It ranges, 
theoretically, between negative and positive infinity but practically 
between zero and about three when ability is expressed in a standard- 
score metric. A negative a parameter would mean that a low-ability 
examir.ee had a better chance of answering the item correctly than did 
a high-ability examinee. An a parameter of zero would mean that the 
item had no capacity to discriminate between different levels of 
ability (and would therefore be useless as an item in a power test). 
Items with high positive a parameters provide sharper discrimination 
among levels of aoility and are generally more desirable than items 
with low a parameters. 

The b parameter indicates the difficulty level of an item. It 
is scaled in the same metric as ability and indicates the value of 
theta an examinee would need in order to have a 50-50 chance of know- 
ing the correct answer to the item. This is not, however, the level 
of theta at which the examinee has a 50-50 chance of selecting a cor- 
rect answer if it is possible to answer the item correctly by guessing. 

The c parameter gives the probability with which a very low- 
ability examinee would answer the item correctly. It is often called 
the guessing parameter as it is roughly the probability of answering 
the item correctly if the examinee does not know the answer and guess- 
es at random. Intuitively, the \ parameter of an item should be the 
reciprocal of the number of alternatives in the item. Empirically, 
it is typically somewhat lower than this. 

All four parameters enter into the three-parameter logistic test 
model to determine the probability of a correct response. The formal 
mathematical relationship is given by Equation 1: 
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P(u=1 je) = c * (1-c) m.7a<e-b)] [1] 

where: 

*<x> = [Uexp(-x)]~ 1 



In Equation 1, u * 1 if the response to the item is correct and u s 0 
if the response is incorrect. The relationship expressed in Equation 1 
is shown graphically in Figure 1. The item characteristic curve 
drawn with a solid line is for an item with a: 1.0, b»fl.O, and c = 
.2. The slope at any point i* related to a. The lower asymptote 
corresponds to a probability or c of .2. The item characteristic 
curve shown with a dashed line is for an item with a : 2,0, b : 1,0, 
and c z .2. The midpoint of the curve has shifted to 0 s 1.0. The 
slope of the curve is steeper near e = b. The lower asymptote of the 
curve remains , however, at .2. 

Ultimately, theta is the only parameter that needs to be esti- 
mated; the objective of testing is to estimate an individual's abil- 
ity level. To accomplish this, however, it is necessary to first 
know the item parameters. The items must therefore be calibrated. 



Figure 1. Item Characteristic Curves 




Although Ree (1979) has shown that, under certain conditions, ability 
estimation car\ proceed very well with quite poor estimates of item 
parameters, in the general case, good estimation of ability requires 
good estimation of item parameters. 



Item Calibration 

Estimation Techniques 

Two methods of estimating item parameters have been primarily 
employed in IRT applications: maximum-likelihood estimation and 
minimum chi-square estimation. The former method identifies the 
parameter values for which the probability of observing the observed 
data (i.e., the likelihood) is a maximum. The latter method identi- 
fies the parameter values for which the discrepancy between the model 
and the observed data is a minimum. Both methods are discussed in 
detail below with general reference to three-parameter models. 

Haximum-likellhood estimation . Conceptually, the application of 
maximum-likelihood techniques to estimation of item parameters is 
simple. The probability of observing a response vector is expressed 
in terms of the unknown parameters, and the parameter values making 
this probability a maximum are the maximum-likelihood parameter esti- 
mates. In practical calibration applications, however, the number 
of parameters to be estimated may exceed several thousand and the 
numerical difficulties make the simple conceptual task practically 
formidable. 

Two approaches to maximum-likelihood item calibration are the 

unconditional and the conditional approaches 1 (Bock, 1972; Bock 4 
Lieberman, 1970). In the unconditional approach, a distribution 
of theta is assumed and the theta parameter in each individual 
response vector is integrated out. This results in a set of like- 
lihood functions, one function for each examinee, that is independ- 
ent of theta. From these functions, the item parameters can be 
estimated. There are two difficulties with use of the unconditional 
approach. First, it requires an assumption as to the form of the 
distribution of theta and, second, due to the integration required, 



1. The terms "unconditional" and "conditional" as used here should 
not be confused with the identical terms used in the Rasch literature 
(e.g., Anderson; 1971, 1977; Gustafsson, 1979; Reckase, 1977). "Un- 
conditional" in the Rasch literature refers to the "conditional" case 
discussed here. "Conditional" in the Rasch literature refers to the 
use of likelihood functions conditioned on the sufficient number cor- 
rect statistic and is, in some ways, analogous to the "unconditional" 
approach discussed here. 
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it is computationally too burdensome for use with more than a few 
items. 



The conditional approach assjraes the theta values are unknown 
but fixed parameters to be estimated in the same manner as the item 
parameters. The computer program LOGIST (Wood, Wingersky, 4 Lord, 
1976) is the major operationalization of the conditional approach 
co calibration. Although, in theory, both theta parameters and item 
parameters can be estimated simultaneously, LOGIST iterates between 
estimation of theta and estimation of item parameters. Provisional 
values of theta are obtained from each examinee's raw score and these 
are used as true theta values while the item parameters are estimat- 
ed. The estimated item parameters are then used to re-estimate the 
theta parameters and the procedure iterates until stable item and 
theta parameter estimates are found. Convergence can require a large 
amount of computation. 

Minimum chi-square JB^ination. Regardless of how the parameters 
of the model ^ are ^estimated , the adequacy with which the model fits 
- the observed data can be tested with a Pearson chi-square test. 

This is accomplished by grouping subjects on the basis of ability (or 
estimated ability), predicting for aach item the proportion of sub- 
jects in each subgroup who should answer it correctly according to 
the model, and testing the significance of the discrepancy between 
observed and predicted proportions using a chi-square test. The 
minimum chi-square approach to estimation explicitly selects param- 
eter values to minimize this chi-square value. Except for the 
change in criterion, however, the approach is similar to the condi- 
tional maximum-likelihood approach. < 

A major proponent of this approach was Urry (1978), who sponsored 
several computer programs to perform such estimation; the most fre- 
quently used are OGIVIA and ANCILLES. Tn these programs, examinees 
are scored based on provisional parameter estimates. Several trial 
values of the c parameters are chosen and a and b parameters are esti- 
mated using equations given by Urry (1976). The combination of a, b, 
and c that produces the minimum lack of fit with the IRT item charac- 
teristic curve, as indicated by a chi-square statistic, is chosen as 
the minimum chi-square parameter estimate. 
1 

Crit eria of ^oqd EsUmatiqn 

Texts in statistic* (e.g., Lindgren, 1976) typically list four 
desirable characteristics of an estimator of a parameter: an esti- 
mator should be unbiased, efficient, sufficient, and consistent. An 
unbiased estimator has an expected value equal to the parameter it 
estimates. An efficient estimator has, in comparison to other esti- 
mators, small mean squared-deviation from the parameter. If the 
estimator is unbiased, its variance is an index of its efficiency. 
A sufficient estimator contains all the information regarding the 
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parameter that is available from the data on which it is calculated. 
Information of an unbiased estimator is an estimate of the recipro- 
cal of the squared error of estimate of the parameter (see Lindgren, 
1976 1 for a discussion of information). An unbiased sufficient 
estimator is efficient in an absolute sense as no other estimator can 
be more efficient. Finally, a consistent estimator is one that con- 
verges on the parameter values as the aata on which it is based in- 
crease. Increased data, in psychometric applications, refers to both 
increased subject sample size and increased item set size (i.e. t mere 
items). Both must approach infinity for item and ability parameter 
estimates to converge on their true values, but acceptable estimates 
can be obtained from sample sizes that are obtainable in practice. 

Evaluation of the quality of estimators in terms of these cri- 
teria can be done analytically in simple applications. In evalua- 
tion of item calibration techniques, analytic calculation of these 
criteria is practically impossibly because of the complexity of the 
calculations. Hence, they must be Evaluated through simulation 
techniques. In such a simulation, responses to items with known 
parameters are generated according to a statistical model (see Vale 
& Weiss, 1975, or Ree, 1973, for a full description of a simulation). 
Parameters are then estimated from the item responses as if these 
responses had been generated by real examinees, and the estimated 
parameters are compared to the true values. In studies done com- 
paring estimated with true item parameters, three indices of com- 
parison have typically been calculated for individual item param- 
eters. The average algebraic difference between true and estimated 
parameters has been calculated as an index of bias. The mean-square 
deviation of estimated parameters from the true parameters has been 
calculated and can be considered an index of efficiency. The corre- 
lation between true and estimated parameter values has been calculated 
and, if the estimates are linear estimates of the parameters, this can 
be thought of as an index of relative sufficiency when comparing two 
methods on the same items and subjects. All these indices are typi- 
cally calculated at several combinations of test length and sample 
size and thus provide some evidence for consistency. 

In addition to evaluation of the parameters separately, some 
researchers (e.g., Ree, 1978? have attempted to evaluate the param- 
eters collectively by comparing the test scores produced by the est- 
imated parameters with those produced by the true parameters. There 
may be some tendency for errors in one parameter to cancel out or com- 
pensate for errors in other parameters. Separate evaluation would not 
show this effect; joint evaluation would. As will be discussed in re- 
gard to the study by Ree, this evaluation ma> be done in several ways. 

Evaluation o f Estimation Techniques 

Lord (1975) evaluated the LOGIST procedure in a simulation study. 
For this study, item parameters for 90 verbal items of the Scholastic 
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Aptitude Test were estimated by LOGIST using a sample of 2,995 exam- 
inees. These parameters, after correction for errors of estimate, 
were used as the basis for a Monte-Carlo simulation in which 2,995 
hypothetical examinees (with abilities similar to those of real exam- 
inees) "responded" to the items according to the logistic test model. 
These responses were then used by LOGIST to re-estimate the item param- 
eters. The parameters entering the simulation model were taken to be 
true parameters, and the effectiveness of LOGIST was evaluated by how 
accurately these true parameters were estimated. Root-mean-square 
errors of estimation and the correlations between true and estimated 
parameters were, respectively, .130 and .920 for the a parameters and 
.196 and .983 for the b parameters. For the c parameters, the root- 
mean-square error was .070; the correlation between the true and esti- 
mated c parameters was not reported. 

Gugel, Schmidt, and Urry (1976) reported a similar simulation 
study of the minimum chi-square procedure. Some major differences 
between this study and that of Lord's (in addition to the different 
estimation procedure) were that (a) the hypothetical subjects were 
drawn from a standard normal ability distribution rather than matched 
to subjects Hiving taken an existing test, (b) the hypothetical item 
parameters were rectangularly distributed in ranges typical for such 
parr eters rather than matched to those from an existing test, and 
(c) subject sample sizes and item set sizes were systematically 
varied. Of the conditions investigated a condition with 90 items and 
2,000 subjects was most comparable to Lord's study of LOGIST. In this 
condition, root-mean-square errors and correlations were, respective- 
ly, .2i»i» and .871 for the a parameter, .119 and .996 for the b param- 
eter, and .069 and .568 for the c parameter. Direct comparisons with 
Lord's study are not particularly meaningful, however, because the 
distributions of all parameters were different and this can drastical- 
ly affect the comparative indices. The study did note, however, that 
the minimum chi-square procedure did not work well when the numbers of 
subjects used fell as low as 500. -< 

Schmidt and Gugel (1976) again reported the preceding study, as 
well as a second study in which the number of items used was 100 and 
the sample sizes were 2,000 and 3,000. Root-mean-square errors for 
the final estimates at sample sizes of 2,000 and 3,000, respectively, 
were .212 and .228 for the a parameter, .T23 and .118 for the b param- 
eter, and .056 in both samples for the c parameter. Correlations 
were .915 and .918 for the a parameter, .996 and .997 for the b param- 
eter, and .761 and .760 for the c parameter. Little change was appar- 
ent between sample sizes of 2,000 and 3,000. The results of these two 
studies led Schmidt and Gugel to conclude that, as a rule-of-thumb, 
item sets should contain at least 100 items and should be administered 
to at least 2,000 subjects to obtain an accurate calibration. 

Two studies comparing different calibration techniques have been 
done, to date. Ree (1978, 1979) compared four calibration techniques 
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in three different populations. The four calibration techniques 
were: (a) ANCILLES, minimum chi-square estimation with ancillary 
correction for errors in estimation of ability, (b) :GIVIA, minimum 
chi-square estimation similar to that of ANCILLES, (c) LOGIST f the con 
ditional maximum, likelihood approach, and (d) transformation of class 
ical parameters derived from IRT assuming a normal distribution of 
ability (see Jensema, 1976, for a description of the transforma- 
tions). The three ability distributions were: (a) a rectangular dis 
tribution of ability bounded ate : ±2.5, (b) a normal (0,1) distribu 
tion of ability with elimination of the lower third on the basis of a 
number correct score, and (c) a normal (0,1) distribution of ability. 
The hypothetical items used in the simulation had parameters dis- 
tributed normally in ranges typically found in real item sets. Among 
the criteria investigated were: (a) correlations between true and 
estimated item parameters, (b) correlations between ability estimates 
computed using both t^ue and estimated item parameters, (c) correla- 
tions between true mincer-correct scores generated using both true 
and estimated item parameters, and (d) test information curves re- 
sulting from the true and estimated item parameters. All analyses 
were performed on Samples of 2,000 examinees and tests 80 items in 
length. 

Evaluated on the criterion of correlation between estimated and 
true item parameters, LOGIST generally produced the highest correla- 
tions. The exception to this was in the normal ability distribution 
in which OGIVIA produced slightly better estimates of a and b. The 
best estimates of the item parameters were obtained using LOGIST and 
a rectangular distribution of ability. 

Correlations between true and estimated ability levels showed 
LOGIST to be slightly better than ANCILLES and OGIVIA, and the trans- 
formations to be slightly worse. Differences among correlations were 
small, however, ranging from .955 to .97M in the rectangular distri- 
bution, from .930 to .9^3 in the truncated normal distribution, and 
from .961 to .965 in the normal distribution. 

Correlations between true scores obtained using true and esti- 
mated parameters showed very little difference among methods and 
only a small deviation from unity. The largest difference observed 
was in the rectangular distribution where the transformation yielded 
a correlation of .9910 and LOGIST yielded one of .9960. All other 
distributions produced correlations of .999, with variations in the 
fourth decimal place. 

When compared in terms of^ the information curves produced by the 
item parameter estimates, all /methods except the transformations pro- 
duced information curves similar to the true information curve in the 
rectangular and normal ability distributions. In both of these dis- 
tributions, LOGIST produced information curves somewhat closer to the 
true curve than did A^ICILLE^ or OGIVIA. In the selected distribution 



25 

I 



all methods produced noticeable departures from the true information 
curve^ 

OfV the four criteria investigated, only the correlations among 
item parameters and the information curves are independent of the 
ability distribution; thus, these criteria are the only ones that 
can be compared across ability distributions. (Equivalent estimation 
accuracy would yield differences in the other criteria solely as a 
function oIT the ability distribution.) On these two criteria, LOGIST 
was nearly lalways superior to the other methods. The degree of 
superiority yas not overwhelming, however, and an analysis of cost 
>sugg£sted that other methods were to be favored. The second-best 
procedure, in\terms of psychometric criteria, was OGIVIA. OGIVIA 
required less Vhan one-tenth as much computer time to use as did 
LGGISt. \ 

\ Asa final pbint, the level of correlation between actual and 
estimated ability\evels and actual and estimated true scores is 
noteworthy. Especially with the true scores, the level of corre- 
lation uah so high as to suggest that one might do well enough with- 
out bothering to estimate parameters at all. In fact, Ree (1979) 
has sh^wn that the correlation between the estimated 1 and true values 
of any pne of the threA IHT parameters can be degraded to little 
relation witHv its true Value and still yield correlations between 
actual af\d estimated trite scores of ,93 and above. All these re- 
sults, however, were obtained using conventional tests where all 
examinees Vnswei^ the sameXitems. When administration is adaptive 
and each examined answers^ different set of items, these correla- 
tions could be expected toWop substantially as a result of poor 
item calibration. \ Unfortunately , no study has investigated this 
effect direcily. S^lvnidt anei Gugel (1976), in the study discussed 
^arlier, provided d4ta that Minted at the answer. When the size of 
t)\e calibration sample fell t\ 1,000 examinees and the length of the 
calibration itfem set fell to 60, there was a noticeable decrease in 
the quality of tests administered using a Bayesian strategy when 
compared to similar tests given\using true item parameter values. 
Tftys, although definitive data d© not exist, those data which do exist 
suggest that the extremely high correlations between estimates of true 
sco*e$ obtained using the different parameter estimates may be due to 
an ay^raging-out phenomemon peculiar to conventionally administered 
testpY \ 

The secohd study comparing various calibration procedures was 
done ttv Swaminathan and Gifford O980K Noting that the Ree study 
investigated only a single test lengtm and sample size, they com- 
pared AiJCILLES and LOGIST in simulation at test lengths of 10, 15, 
20 t and^BO items and sample sizes of 50\ 200, and 1,000. Items had 
true a parameters distributed rectangularly between .6 and 2.Q, true 
b parameters distributed rectangularly between -2.0 and 2.0, and true 
c parameters fixed at .25. Three distributions of ability were used; 



one was normally distributed with a mean of zero and variance of 
one, the second was rectangularly distributed between -1.73 and 1 . 73 » 
and the third was i standardized negatively skewed beta distribution. 
Criteria of calibration effectiveness included the differences between 
means of true and estimated a, b, and c parameters, the correlation? 
between true and estimated a and b parameters , the differences in 
means of ability estimates using True and estims^ed parameters, and 
the correlations between these values. 

The b parameter estimates correlated highly with their true 
value3 in all conditions using either of the calibration methods. 
Medians for each of the distributions were all above .9. A trend 
toward higher correlations with increased test length was observed, 
and median correlations for LOGIST were slightly higher than those 
for AMCILLES. No substantial differences were observed among dis- 
tributions. 

The a parameters were less well estimated. Median correlations 
were near ".4 for the normal and rectangular ability distributions, 
but dropped to near .? in the skewed distribution. Improvements in 
estimation occurred both with increasing test length and sample 
size, however. Median correlations using LOGIST were consistently 
higher than those of ANCILLES. 

Correlations could not be computed for the c parameters since 
the true values were fixed at .25. 

Correlations between ability estimates and true abilities were 
nearly equivalent for the two procedures. Increases were noted with 
increasing calibration test length but increases in sample size made 
trivial differences. 

The mean-difference criteria suggested that both item param- 
eters and ability estimates were biased somewhat. In general, AN- 
CILLES produced more bias than LOGIST. Bias decreased with increas- 
ing test lengths and sanple size. 

Swaminathan and Gifford concluded that, although LOGIST produced 
slightly better estimation than did ANCILLES, it cost considerably 
more to run and the gain was probably not worth the cost. They fur- 
ther concluded that a and c parameters should not be estimated using 
tests containing 15 or f **wer items. 



Item Linking 

Predicting, Equating, and Linking— A Clarification of Concepts 

Scores from one test are often used to infer scores on a second 
test. Whether this inference is an act of predicting, equating, or 
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linking will depend on the tests involved and the method used in 
making the inference. 



Equating and predicting . Methods for equating test scores among 
different groups of people have long been available. Publishers of 
entrance examinations f *• educational institutions, faced with the 
need to change the examinations each time they were administered and 
aware that different types of people took the examinations in April 
and October, developed the means of assuring that a person of fixed 
ability would attain approximately the same score regardless of when 
the examination was administered. Formally, equating methods are pro- 
cedures for expressing scores from two different tests measuring the 
same trait on a common score metric. The crucial requirement is that 
the tests measure the same trait. 

Methods for predicting one test score from another have also 
long been available. The reason for giving entrance examinations in 
the first place was based on the empirical fact that scores on the 
entrance examinations predicted, to some degree, scores on classroom 
examinations. The difference between equating and prediction is that 
two tests do not have to measure the same trait ' o be candidates for 
prediction. 

Statistical methods for equating and predicting come in both 
linear and non-linear forms. In the linear case, prediction is accom- 
plished by linear regression. Equating is accomplished by a similar 
procedure in which 9 correlation of 1.0 between tests is assumed. 
Prediction uses the Empirical data to estimate the relationship between 
the two traits. Equating assumes, not unreasonably, that a trait 
should correlate very highly (i.e., perfectly) with itself. The pre- 
diction equation is not invertible; a regression equation used for 
predicting test A from test B cannot simply be reversed and re-applied 
to predict test B from test A. The exception to this rule is when the 
correlation between tests is perfect. The assumption of perfect cor- 
relations made in equating allows the equating equation to btf used for 
the inverse transformation. 

If equating procedures are used for a prediction problem, the re- 
sult will be less-than-optiraal predictions. If regression is used/ 
for an equating problem, t.ie result will be a lack cf correspondence 
between test scores, which was the objective of equating in the first 
place. 

Linking . Linking is a term which describes the act of equating 
at the item level. The objective in equating, as discussed above, 
was to put total test scores onto a common metric. Linking is used 
to describe the process of putting items # — ni different tests on a 
common metric. Linking was first invest - d as a means to an end 
of test equating (Fan, 1957; twineford & , 1957) and did not gen- 
erate a great deal of research interest. More recently, as a result 
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of adaptive testing applications, linking has become a legitimate end 
in itself. Adaptive testing item pools, because of their size, have 
had to be constructed by linking smaller sets of items together on 
a common parameter metric. 

The objective of this project was to find efficient ways of link- 
ing test items. Much of the research available to date has been on 
equating rather than linking. There are close parallels between the 
two, however, and the following review will include equating ai well 
as linking efforts. Prediction is a vast subject and will not be 
covered except to point out instances in which it was used appropri- 
ately as a linking or equating method. 

Paradigms of Linking and Equating 

Linking and equating paradigms can be categorized on two basic 
aspects: the design by which data are collected and the method by 
which the linking transformation is determined. Angoff (1971),: in a 
classic survey of equating methodology, listed six major equating 
designs. In terms of data collection, these six designs can be 
grouped into two categories: designs assuming equivalent samples of 
examinees to achieve equation (Designs I and II) and designs employ- 
ing an anchor test to achieve equation (Designs III, IV, V, and VI). 
Transformations, in Angoff v s designs, are determined either through 
linear or curvilinear means. Marco (1977) t in a recent survey, 
listed three data collection designs: (a) all items are given to a 
single group of examinees, (b) the same set of items is administered 
tu Afferent groups of people, and (c) an anchor set of items is 
common to all tests given to different groups of people. 

There are, in fact, four basic data collection designs of poten- 
tial utility for linking: (a) the equivalent-groups method, (b) the 
equivalent-tests method, (c) the anchor-group method, and (d) the 
anchor-test method. Angoff s first two designs are contained in the 
equivalent-groups method, and his latter four are examples of the 
anchor-test method. Marco's three designs are, respectively, a 
special case of the equivalent-groups method, a special case of 
the equivalent* tests method, and the anchor-test method. 

In theory, IRT explicitly makes the relationship among item 
parameters, across groups, linear. There is thus no need to discuss 
the curvilinear transformation procedures. Reckase (1979) presented 
the most exhaustive array of linear procedures yet encountered. As 
will discussed, however, only the one called the major axis proce- 
dure is an appropriate linking transformation method. Transformation 
methods thus do not offer much ground for research. 

In theory, IRT item parameters are invariant, except for a lin- 
ear transformation, across groups of individuals. The constants of 
the linear transformation necessary to change one metric to another 
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(assuming a unidimensional pool of items), are simple functions of 
the means and standard deviations of the abilities of the groups 
under consideration. When items are calibrated, there are four 
values that are undetermined and must be arbitrarily imposed: the a 
and b parameter means and the ability mean and standard deviation. 
Among this group of four values, there are two degrees of freedom 
corresponding to unit and origin of the metric to be chosen. The 
unit can be specified by fixing either the mean a parameter or the 
standard deviation of the ability distribution. "When one is fixed, 
the other is determined. The origin can be specified by fixing 
either the mean b parameter or the mean of the ability distribution. 
Again, when one Is fixed, the other is determined. Any one of the 
values can be varied at will as long as the corresponding value is 
also appropriately adjusted. 

As an example, assume that a set of items had been calibrated on 
a group of individuals and that the ability mean and standard devia- 
tion were set at zero and one, respectively. If desirable, the 
ability mean and standard deviation could be changed to 50 and 10. 
To do this, each ability estimate would be multiplied by 10 and 50 
would be added. Also v the a and b parameters would have to be *4J[yst- 
e<i accordingly. In this case, the a parameters would have to be di- 
vided by 10 and the b parameters transformed by multiplying them by 

10 and adding 50* -The ^e_par ameter JL3_e v 9l uated j>t a n infini tely low 

\ ability level and is thus not affected by the transformation (i.e., 
\any finite linear transformation leaves negative infinity untouched). 
A linear transformation such as this could be used to set the mean 
fend standard deviation of the ability distribution or the mean a and b 
values to any value without affecting the performance of the ICC model 
as long as both parameters were adjusted in the two pairs. 

Item ljLnking in IRT models consists of finding two common values 
(i.e., abiUty mean and standard deviation or item parameter means) 
in different sets of items given to different groups of people and 
then of determining a linear transformation that equates these values 
as well as the remaining two values which are determined by them. 
In the methods discussed in the next paragraphs, different sets of 
assumptions necessary to match values will be presented. The differ- 
ences between the methods are in the groups chosen as the reference 
groups and In the parameters matched. The concept of the linear 
transformation to equate Item parameters is the same for all methods. 

Methods based on sampling . In the equivalent-groups method of 
item linking, a sample of examinees available for item calibration. is 
randomly split into two or more groups, and each group is given a 
different set of items. It is assumed that the distributions of 
abilities are equal in the various groups; ability mean and standard 
deviation are the values matched across groups in this method. Param- 
eters a, b, and c are estimated separately in each group, abilities are 
estimated? and ability levels and item parameters are simultaneously 
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transformed such that the ability means and standard deviations 
of the groups are equal. The mean and standard deviation (i.e., 
origin and unit) of ability are arbitrary v' en items are calibrated 
and must be set to some values. Calibrate programs (e.g., LOGIST 
or OGIVIA) typically set them to zero and one, respectively. In the 
equivalent groups method of linking, which assumes equal ability 
distributions, setting means and standard deviations equal, as is 
done by the program, puts all parameters on a common metric. 

The equivalent tests me' «iod allows an item pool to be divided 
randomly into sets of items and these sets of items administered to 
different groups of examinees. It Is assumed that the item subpools 
are equivalent, and thus the method derives from the concept of ran- 
domly parallel tests. Item parameter means are the values matched 
across groups, and no assumption is required about the distribution 
of abilities in the samples of examinees. As in *,ne equivalent 
groups method, parameters a, b, and c, as well as abilities, are esti- 
mated separately in each group. The~difference is that the ability 
estimates and the a and b parameters are simultaneously adjusted such 
that the item parameter means, rather than the ability mean and stand- 
ard deviation, are constant across groups (e.g., mean a of 1.0 and b 
of 0.0), Theoretically, the c parameter does not change across groups. 

Methods based on anchoring . In the anchor-group method, a 
common group (i.e., anchor group) of individuals takes all items in 
the pool. Each subset of items is administered to a calibration 
group consisting of the anchor group and an additional group of 
examinees. The distribution of ability in the anchor group is taken 
as a standard, and no assumption of randomly sampled examinees or 
items is required. This method is conceptually very similar to the 
equivalent-groups method. Items are calibrated independently in each 
of the calibration groups as in the equivalent-groups method. The 
difference lies in the group of examinees on which the origin and unit 
of ability are established. In the equivalent-groups method, the 
mean and standard deviation of ability are assumed constant across 
calibration groups so the mean and standard deviation of ability in 
each of the groups is set to the same value. In the anchor-groups 
method, only ability «n the anchor group is constant across calibra- 
tion groups so, within each calibration group, a linear transformation 
of the Item parameters is found which makes the ability estimate means 
and standard deviations within the anchor groups constant across cali- 
bration groups (e.g., 0.0 and 1.0). 

The anchor-test method is based on a common set of items admin- 
istered to all examinees. The anchor items are taken as the stand- 
ard against which all other sets of items are calibrated. Parameters 
of the anchor test items are first estimated on the entire sample 
from the population of examinees. The mean and standard deviation of 
ability in this sample can arbitrarily be set to zero and one, res- 
pectively. Then for each subset of non-anchor test items given to a 
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subgroup of examinees from the available population, item parameters 
and abilities are estimated. Each examinee in a subgroup will have 
an ability estimate from the anchor test items and another ability 
estimate from the non-anchor test items. Since the metric of the 
anchor test items is the standard, a transformation of item param- 
eters of the non-anchor test items must be found which will make 
ability estimate means and standard deviations equal for both anchor 
and non-anchor test items. As was the case with the anchor-group 
method, no assumptions regarding the distribution of item parameters 
or abilities are required. 

Composite network methods . The term network linking will be 
used to refer to any linking paradigm in whictione of the anchor 
methods discussed above is used to simultaneously link items from 
more than two tests. Included in this category are the cascading 
schemes discussed by Angoff (1971) as well as the more complex net- 
works described by Wright (1977) and Forster and Ingebo (1979). Con- 
ceptually, network procedures accomplish the same thing as the simple 
methods discussed above. They also provide advantages not available 
in the simple methods, however. Cascading schemes allow more effi- 
cient use of subjects when abilities are spread over a wide range. 
The more complex networks allow this and additionally allow inde- 
pendent checks on the links and evaluation of linking adequacy. 

Criteria of Linking Adequacy 

Item linking and item calibration are two psychometric activi- 
ties that are intimately interrelated in practice. They are con- 
ceptually, however, two distinct operations, and it is important 
to recognize this fact when evaluating criteria for the adequacy 
with which each is done. Adequacy of calibration is evaluated by de- 
termining the accuracy with which the parameters of the items are es- 
timated. The essence of IRT linking, however, is embodied in the 
linear transformation used to put items onto a common metric. This 
transformation is specified by two parameters: unit and origin. It 
is thus the accuracy with which these two parameters are estimated 
that determines the adequacy of the link. Estimates of the two 
parameters are subject to the same estimation quality criteria dis- 
cussed above in reference to the ite» parameters: unbiasedness, 
efficiency, sufficiency, and consistency. 

Few of the studies discussed below have given adequate thought 
to the oriteria of linking effectiveness. In most cases, linking and 
calibration effects have been hopelessly confounded. In some studies 
of linking, no criteria that adequately reflect linking adequacy have 
been included. These deficiencies will be pointed out as the studies 
are discussed. More appropriate oriteria will be presented later in 
this report. 
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Rasch Model. Of all the IRT models, the Rasch model is by far 
the simplest. It is a special case of the three-parameter logistic 
model which specifies that items can differ only in terms of diffi- 
culty. Graphically, this means that each ICC has the same slope but 
a different position to the right or left on the theta continuum. 
Although not a model of prime interest to the current research, be- 
cause it fails to consider that guessing is possible in multiple- 
choice tests, most of the recent studies of linking and equating have 
been done using the Rasch procedure. A representative sample of these 
studies is thus reviewed below. 

As in other logistic models, the Rasch ability parameters and 
item difficulty parameters (the only parameters in the Rasch model) 
are expressed on a common scale. Lack of an item discrimination 
parameter puts an additional restriction on the model in calibration: 
all items must be equally discriminating. In typical formulations 
of the model, the effective value of the common a parameters is 1/1.7 
or about .59. If the actual value (in the logistic model frame of 
reference) is . r 9, the ability distribution will have a variance of 
1.0. If the actual value is anything else, the variance will be 
other than 1.0. Similarly, if the average person ability is equal 
to the average item difficulty, or item easiness in Rasch termin- 
ology, the mean of tie ability distribution (in the logistic frame 
of reference) will be 0.0. 

Linking, as is commonly done with the Rasch model, consists of 
determining an additive constant to adjust both item easiness and 
ability values to a scale having a common origin. This is typically 
done in one of two ways. The first method requires that a common 
group of examinees respond to the iU i sets to be equated. Since 
the ability of the sample of persons is the same in both item sets, 
any differences in average ability computed from the different item 
sets are due to differences between the item sets. The second method 
requires that two groups of examinees respond to two item sets which 
share a common subset of items. In this method, the model states 
that because the common core of items should have the same average 
item easiness in both sets, any observed difference is due to differ- 
ences in ability levels of the two groups in which the two sets of 
items are calibrated. An adjustment making the item easiness equal 
in the core items can be applied to the non-core items to place them 
onto the common scale. 

In order for linking to be possible in this simple form, the 
discriminating powers of the items must be constant not only within 
tests but also across tests. Otherwise, only the means of the tests 
would be equated and not the variances. Most of the studies in- 
volving the Rasch model make the assumption of equal item discrimin- 
ations across tests. 
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Several recent studies have investigated the utility of the 
Rasch model for the equating/linking of the National Board Medical 
Examinations. Bell (1979) used an anchor test to equate a 225-item 
Physician's Assistants Examination given in 1978 with a similar ver- 
sion given in 1976 (referred to here as the reference test). The 
anchor test was a 46-item set that had been included in all Physi- 
cian^ Assistants Examinations given since the testing program was 
begun. Bell evaluated two procedures in terms of their ability to 
answer two questions: 

1. Is the ability level of current examinees higher than the 
reference group on which the reference test was originally 
calibrated? 

2. Are the items on the current test more difficult than 
those on the reference test? 

The procedures Bell compared were the Rasch model and several 
variants of linear raw score equating. For the Rasch procedure, each 
examination was calibrated separately. This yielded easiness param- 
eters for each item set and ability estimates for each examinee group. 
Using a shift constant computed from the 46-item anchor test, ability 
scores from the current test were shifted to the scale of the reference 
test. The linear raw-score equating procedure began by estimating the 
mean and variance for both tests from the performances of the current 
group and the reference group on their respective tests and the com- 
bined (current and reference) group on the common items. These esti- 
mates were then used in a linear equation to yield a raw-score conver- 
sion. This procedure was not specified in detail but reference was 
made to Angoff's (1971) equating procedure for groups not widely dif- 
ferent in ability. Bell concluded that although each procedure was 
capable of answering the question about the ability level of the cur- 
rent examinee group, only the Rasch model answered the question about 
whether the difficulty of the current items had increased. No dis- 
cusssion was given as to the fit of the data to the Rasch model so 
judgment of the accuracy of the equating cannot be made. Due to the 
brevity of the paper, no more detailed inferences can be drawn. 

Kelly (197v) discussed a large Rasch linking study in which items 
•from two forms of a 1,000-item examination were linked together onto a 
common scale. The tests ised, licensing examinations for medical doc- 
tors, were each composed of seven subtests of approximately equal 
length, assessing areas as diverse in content ^s biochemistry and be- 
havioral science. Kelly made the assumption that these subtests all 
measured knowledge of medical science and were unidimensional enough 
in total to allow Rasch calibration. Statistical tests of this as- 
sumption, not described in enough detail to evaluate, reportedly sup- 
ported its tenability. , 




Kelly described two studies. In the first, the seven subtests 
of a reference form of the test were administered to approximately 
8,500 second-year medical students. Items in this test were all put 
onto a common scale by shifting subtest difficulty by an amount 
necessary to make ability estimate means zero for each of the sub- 
tests. The implicit assumption of equal item discrimination among 
subtests was apparently not tested. A second form of the test, the 
current form to be linked to the reference form, was given to ap- 
proximately 3,000 second-year medical students. There were an un- 
specified number of common items between corresponding subtests in the 
two test forms. The linkage between the forms was established by 
first calibrating items of each subtest in the current form in the 
current group and then setting mean difficulties of the common items 
within subtests equal across the two forms. Uncommon items in the 
current test were put onto the reference test metric by adjusting them 
using the constant used to adjust the common items in the correspond- 
ing subtest. This resulted, given the assumptions, in a pool of 2, COO 
items all linked onto a common scale. 

In the second study that Kelly described, both the reference test 
and the current test were first calibrated separately as 1,000-item 
homogeneous tests. Linking was accomplished by finding the constant 
that adjusted the common items to have equal mean difficulties in the 
two examinee groups. This was done in the same manner used for the 
subtests earlier. The difference here was that the entire test was 
linked at one time. This study was primarily descriptive rather than 
evaluative and^ as such, provided no information on comparisons of 
linking designs. It did, however, illustrate two different designs. 
In the first study, linking was accomplished using a degenerate case 
of the equivalent-groups method (in which the groups were identical) 
and the anchor- test method. The second study used the anchor-test 
method exclusively. 

The major flaw in Kelly's study is that it was purely descriptive 
rather than evaluative. It would have been informative, for example, 
to have a comparison of the two equating procedures using the same 
data. It seems reasonable to assume that both procedures would yield 
nearly the same results, but an empirical validation would be more 
convincing. 

In the third study, sponsored by the National Board of Medical 
Examiners, Hughes ()979> used data from six tests given to different 
groups of examinees and placed the tests onto a common scale. Each 
test was composed of either 10 or 11 sets of six multiple-choice ques- 
tions for a specific physician-patient interaction. The common-item 
links were thus composed Of sets of questions, an arrangement that 
probably violated the loc*l independence assumption of IRT. 

The procedure for linking the six tests consisted of o complex 
network of jonwuon-item links. An iterative procedure computed 



estimates of each test's average difficulty on a common scale and ex- 
pected values of the shift constant for tests having no common-item 
link. Two indices were proposed to identify inconsistent triads and 
links: a triad index and a link index. Mo information was provided 
about the distribution of these indices. Thus, only relative state- 
ments about the quality of the linking networks could be made. Al- 
though no conclusions were stated, use of the links and triad in- 
dices as diagnostic tools in evaluating the quality of Rasch linking 
was suggested. 

Rentz and Bashaw (1975, 1D77) applied item analysis and scaling 
methods of the Rasch model to data from the equating phase of the 
Anchor Test Study (Loret, Seder, Bianchini, 4 Vale, 197M) in the 
development of the National Reference Scale (MRS) for reading. The 
MRS was developed from seven widely used standardized reading tests 
consisting of vocabulary and comprehension subtests. There were 
two forms of each test, a primary and an alternate form. All 14 
tests were chosen to be appropriate for grades 4, 5, and 6. 

Seven pairs of tests were studied at each of the three grade 
levels. Each examinee responded to two reading tests. Each pair 
of tests was administered, counterbalanced, to two separate samples 
within each grade level yielding a total of 42 samples per grade 
level. In addition, each test was paired with its alternate form, 
counterbalanced within each grade level, and administered to 14 
additional samples. 

All tests at a single grade level were placed onto a common 
scale. Within each grade level, test pairs were calibrated as a 
single long test. The average item easiness was computed for each 
single test and the differences in averages were then computed for 
the test pair. These average differences were organized into 
matrices such that the lower half of the matrix contained differences 
from one order of testing and the upper half of the matrix, from the 
second order of testing. Row and colomn means were averaged, rever- 
sing the signs of the row means (due to reversed orders of admini- 
stration) , to obtain the equating constant averaged over order of 
administration. Tests were then placed onto a common scale defined 
by the Sequential Tests of Educational Progress— Series II (STEP-II) 
which was administered to all ^rade levels. 

Comparisons of equated raw scores (i.e., number correct with no 
correction for guessing) from the Anchor Test Study and the Rascn 
study were made across samples from each study that took the same 
tests in the same order. For each comparison, the first test admin- 
istered was taken as the base test. Conditional mean-squared errors 
were then computed for each base test score. For the* comparisons 
reported, the differences between the equi percentile and the Rasch- 
based equated scores ranged from 0 to 3 raw-score points and were 
deemed inconsequential. 
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Slinde and Linn (1978, 1979) presented a set of studies designed 
to evaluate the adequacy of the Rasch model for vertical equating 
(i.e., equating where tests differ widely in difficulty and examinees 
differ widely in ability). In their first study (Slinde 4 Linn, \ 
1978) response data from 1,365 examinees on a 36-itera mathematics 
achievement test were used. Two tests of differing difficulty were 
obtained by dividing the 36-item test into two 18-item tests on the 
basis of the p-values of the items obtained in the group of 1,365 ex- 
aminees. The average p-values of the tests were .665 for the easy 
test and .362 for the difficult test. The examinees were then divid- 
ed into low*, middle-, and high-ability groups on the basis of their 
scores on the easy test. 

Rasch item parameters were calculated for the total set of 36 
items in the low oup, the high group, and the total group (the 
middle group was .served for later use). Ability estimates were 
then calculated for each of these groups (low, high, and total) using 
parameters obtained from each group in a crossed design. Mean dif- 
ferences between ability estimates derived from the easy test and 
the difficult test were then computed and compared. 

When the total group ability estimates were calculated using 
item parameters obtained from the total group, the difference be- 
tween means obtained from the easy and difficult tests was trivial. 
Similarly, when the high group mean was calculated using item param- 
eters obtained from the high group and when the low group mean was cal- 
culated using the item parameters obtained from the low group, the 
differences were trivial. When items calibrated in the high group 
were used to estimate abilities in the low group or the middle group 
and when items calibrated in the low group were used to estimate 
abilities in the high group or the middle group, substantial differ- 
ences in ability estimate means were found. Slinde and Linn inter- 
preted this to mean that Rasch parameters were not really invariant 
and that Rasch equating procedures were not particularly useful for 
the problem of vertical equating. 

Gustafsson (1979) criticized this interpretation. He suspected 
that the differences between means was due to regression artifacts 
which were due to the fact that Slinde and Linn had estimated abil- 
ities and subgrouped people on the basis of only 18 of their 36 
items. Individuals would not be expected to perform, in a relative 
sense, as extremely in either direction on the itire 36 items as 
they did on the easy 18; therefore, a difference between means would 
be expected. To support his hypothesis, Gustafsson performed a com- 
puter simulation modeled closely after the Slinde and Linn study with 
the notable exception that the assumed invariance properties of the 
Rasch model were built in. His simulation shoyed that the parameter 
estimates obtained in the different groups were different but that 
this was due to a regression artifact and not to a lack of invariance. 
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He suggested that Slinde and Linn reanalyze their data, subgrouping 
individuals on the basis of their total test scores. 

Slinde and Linn (1979) improved upon this idea by obtaining data 
from 1 ,638 examinees on two different tests including a 60-item read- 
ing comprehension test. The first test was used to independently 
subgroup examinees. The 60-item test was then split, on the basis 
of item difficulty, into two 30-item tests and t^heir original study 
was essentially replicated. Their findings were^ that the mean 
differences disappeared in comparisons of the middle with the high 
group. Whenever the low group was compared with another group, the 
differences persisted. This finding was attributed to the effects 
of guessing. No allowance is made by the Rasch model for the possi- 
bility that correct responses can be obtained through guessing. When 
multiple-choice items are used, as was the case here, guessing undoubt- 
edly happens and probably tends to bias the results. Most likely this 
was a more pronounced effect for the low ability group where subjects 
knew the correct answer less often and had more "opportunity" to guess. 

Together these studies suggest that linear equating works as 
expected using the Rasch model but that problems m9y result if the 
model is used in groups of sufficiently low ability that guessing 
occurs with any frequency. Unfortunately, most items used in ob- 
jective tests can be answered correctly by guessing and may often be 
used in environments where guessing is likely to occur. The three- 
parameter logistic model extends the Rasch model to account for guess- 
ing and thus may be more generally useful. 

Three-parameter logistic model . In the three-parameter logistic 
model, as in the simple Rasch model, a linear equation is used to 
link parameters on one test to those on another. The one difference 
in the three-parameter case is the explicit addition of a scaling 
parameter to adjust for changes in unit as well as origin. 

Three studies of linking using the three-parameter logistic 
model were of direct relevance to the present effort. One, a study 
by Reckase (1979), was of interest for two reasons: first he pre- 
sented four methods of determining the linking transformation, and 
second, he attempted to determine acceptable numbers of items to be 
included in anchor tests for adequate linking to be possible. The 
four techniques for item linking he presented were: (a) major axis, 
(b) least squares, (c) least squares with outliers deleted, and (d) 
maximum likelihood. 

The major-axis technique got its name from the fact that the 
parameter transformation equation was derived from the equation for 
the major axis of the ellipse formed by the data points of a bi- 
variate plot of parameters of items in the tests being linked. In 
simpler terms, it amounted to a linear regression of the current pa- 
rameters onto the reference parameters assuming the correlation to be 
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perfect. Adjustment was made for unit and origin but no actual re- 
gression was performed. 

The least-squares procedure was a regression procedure where the 
correlation was determined empirically rather than assumed to be per- 
fect. 'As discussed earlier, this is not a legitimate linking method 
but rather a method of prediction. 

The least-squares-with-outliers-deleted procedure presented was 
the same as the least-squares procedure, but items with parameters 
further than two standard errors from the regression line were de- 
leted. Like the other least-squares procedure, this was not a legi- 
timate linking method. 

The maximum-likelihood procedure described by Reckase was really 
a version of the major-axis method. The procedure, as described, 
made use of the capability of the program LOGIST to treat Hems as 
"not reached* and ignore them in estimation of ability. What LOGIST 
actually does can best be illustrated in the simple paradigm in which 
two tests, with some of their items common, are given to two groups. 
For examinees taking the first test, items unique to the second are 
coded "not reached." For examinees taking the second test, items 
unique to the first are treated as "not reached." LOGIST estimates 
abilities for all examinees using all items "reached." This means 
that each examinee is scored on those items contained in the test 
taken. Using these ability estimates, item parameters are then esti- 
mated. Before the estimation process, which is iterative, can proceed 
to another stage, the ability estimates are scaled to a mean of zero 
and a variance of one. To do this, all item parameters must be appro- 
priately adjusted. The adjustment is a major-axis transformation de- 
signed to make the parameters of the common items equal and the over- 
all ability mean zero and variance one. Asymptotically, the same 
result should be achieved by an ordinary major-axis transformation 
following separate calibrations. For estimation, however, the maximum- 
likelihood procedure has the advantage of using all available data on 
the common items for each of the two separate calibrations. 

Reckase used live-testing data obtained from administration of 
; the Iowa Test of Educational Development (ITED) given to 1,000 Iowa 
school students from each of grades 9. 10, 11, and 12. The ITED 
consisted of seven subtests with a total length of 357 items. A 
principal-components analysis produced a sufficiently strong first 
component to suggest unidimensionality. The data were calibrated 
using each of three programs: (a) a Rasch model program written by 
Wright and Panchapakesan ( 1 969) • (b) LOGIST, a three-parameter lo- 
gistic maximum-likelihood program (Wood, Wingersky, & Lord, 1975), 
and (c) ANCILLES, a three-parameter logistic minimim chi-square pro- 
gram (Urry t 1978). 
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This study was designed to evaluate the Joint effects of linking 
method, calibration procedure, sample size, and anchor test size. As 
was discussed earlier, the major-axis method of determining a trans- 
formation was the only true equating method presented and discussion 
will be limited to that method. Sample sizes were 100. 300. 500, _ 
1 000 and 2.000 obtained; using a "systematic sarapl*ng procedure- from 
a total of 4,000 cases. Three levels of item overlap were chosen: 5. 
15, and 25, items. 

Four 50-item tests were linked in each conditiofi. These tests > 
were cascaded in the sense that, except for the first and last test, 
each test was linked to the previous test and the following test by 
two different sets of anchor items. Overlap among items in the two 
anchor sets in each test was permitted. Linking was performed se- 
quentially: the second test was linked to the first, the third test 
was linked to the first two, and the fourth test was linked to the 
first three. 

Each test was calibrated with each calibration program for ench 
sample size, and each set of four tests was linked for each sample 
size and degree of overlap. Thus, for each linking there were 15 
combinations of sample size and common item overlap. ™J reference 
against which linking adequacy was Judged was a full calibration of 
the entire 357-item ^est using the full sample. 

The adequacy of W linking was evaluated in three ways: (a) cor- 
relations between the linked parameter values and the total-test-caii- 
bration parameter value\ (b) a sura-of-squared-deviations quality-of- 
linking index (Wri-ht, 19*p. and (c) scatterplots of linked parameter 
values versus total-test-ckibration parameter values. 

Results of the correlational analysis for the Rasch linking 
shewed a predictable pattern \t increasing correlations as sample 
size and number of overlapping, items increased. No\ statistically 
significant changes in correlation occurred as the rtumber of tests 
linked increased, but significance would have been difficult to Judge 
because .11 correlations were near 1.0. The »™- of V a ^; dev i a ^"? 
quality-of-linking index was commuted and reported fir the Rasch model, 
but because the chi-square value* (a transformation 6f this Index) 
were significant, even when the correlations were of Ifhe order of .999. 
Reckase concluded that this index \bore little relationship to the qual- 
ity of linking. Therefore, this qtiality-of -linking index was not re- 
ported for the three-parameter models. 

For t^e three-parameter calibration models, the correlations 
tended to follow the same increasing \trend as sample size increased. 
No data were available for the 5- or ^5-item overlap combinations, 
therefore, no conclusions could be drawn regarding trends with in- 
creasing item over-lap* From the correlational data reported, there 
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seemed to be evidence to indicate that ANCILLES performed substan- 
tially better than LOGIST. 

One problem is apparent in this study. Linking in an IRT model 
is an attempt to make a linear transformation of parameters from one 
metric to another. Corre? stidns, the major criteria used in this 
study, are insensitive to differences between linear* transformations. 
Although they provide information about the accuracy of calibration, 
they say virtually nothing about the adequacy of linking. The one 
criterion that is related to linking quality, squared error of esti- 
mate, was eliminated from consideration because it showed a difference 
where the correlations showed none. 

As the data for the three-parameter model were not complete at 
the time the report was written, the effects of ^tern-overlap could 
not be evaluated. Furthermore, as only one lining paradigm was pre- 
sented (i.e., an anchor test design) no comparisons among methods 
were possible. Thus, the study served to clarify some issues re- 
garding methods of transformation but did not provide any hard em- 
pirical data regarding linking design for the three-parameter model. 

Ree and Jensen (1980), <n a simulation study, investigated the 
joint effects of varying calibration group sample size and linking 
group sample size on the quality of the item parameter estimates. 
Simulating two tests with common items, a pool of 140 hypothetical 
items was specified. This pool was split into two tests of 80 items 
each. Twenty of the items were common to the two tests. The first 
test, T1, was taken as the reference test and the second test, T2, 
as the current test. Although not stated in the report, the pro- 
gram OGIVIA was used for calibration (Kee, 1980a). 

TTwo groups of 2,000 hypothetical examinees each were generated 
from a standard normal population and a response vector for each 
examinee on one of the two tests was generated according to the three- 
parameter logistic model. Four samples of size 250, 500, 1,000, and 
2,000 were drawn with replacement from each group and were used to 
calibrate the corresponding test. The major axis method of linking, 
described earlier, was then used to link parameters of the current 
test to the metric of the reference test. 

Two criteria were considered in evaluating the quality of the 
parameter estimates. They were the correlations between true and 
estimated item parameters and the average absolute differences be- 
tween true and estimated parameters. y a , *ie portion of the study 
explicitly discussingainking, only tht ~.erage absolute differences 
were presented as correlations were expected to be misleading. 

Both criteria behaved as might be expected fpdm other research 
when accuracy of calibration was investigated separately in the two 
tests. Correlations for the a and b parameters increased and average 
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absolute error decreaaed as sample size increased. No definite trend 
was obvious for the c parameter, however. It was estimated relatively 
poorly at all sample sizes but some improvement was noticeable as the 
sample size rote to 2,000. 

Linking adequacy was investigated at each of 16 combinations 
of reference and current group sample size for the a and b param- 
eters. The c parameter, not in need of linking, was not considered. 
The expected trend toward decreasing error in the current test with 
increasing sample size was observed, for the most part, in the b pa- 
rameters. As the size of the current test calibration sample in- 
creased, error in the b parameters decreased. There was a reversal 
with respect to the sample size used in calibrating the reference 
test: errors of estimation for the current- test b parameters were 
less for reference test calibration samples of 500 than for 1,000. 

Errors in estimating a parameters did not follow such a reason- 
able pattern. Errors, as a function of reference test calibration 
group size, typically decreased with increasing size. Errors, as a 
function of current group size, were highest at a sample size of 250, 
lowest at a sample size of 500, and increasing from 500 to 2,000. It 
is this latter trend that was not expected. 

An interesting comparison present in the data but not discussed 
was the relative quality of linking available from assuming equiva- 
lent groups of individuals when such an assumption is warranted (as 
it was in this study) compared to the quality of linking obtained 
from use of an anchor test. Since the calibration program assumed 
the ability metrics were the same for the two groups, the items were 
automatically linked upon calibration. Errors incurred in this link- 
ing were presented in the last column of Ree and Jensen's Table 5. 
When these results are compared to those obtained using the anchor 
test presented in their Table 6, it can be seen that the anchor test 
method was superior in only three of 16 sample size combinations for 
the a parameter? and never superior for the b parameters. Thus, it 
appears, an explicit attempt to link items is not always necessary 
or desirable. 

The third study of consequence to the present effort was a 
unique application of the three-parameter latent trait model by 
Sympson (1979). The procedure for placing items onto a common 3cale 
was unique in that it required neither overlapping groups 9f exam- 
inees nor overlapping sets of items. The data collection plan is 
schematically shown in Figure 2. Items were rank ordered in terms 
of difficulty and subtests were formed ranging from easy to diffi- 
cult. Each subtest was administered to examinees at the grade level 
for which it was targeted and at the grade levels one level above 
and one level below that. Subtests were calibrated using responses 
of the three groups who took each subtest. 
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Figure 2. Sympson's Data Collection Plan 
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In order to place each subtest onto a common scale when there 
are no common Items or common persons, Sympson suggested that if 
groups are randomly sampled from their respective populations, an 
equivalent-groups condition exists. This is indicated by the dashed 
box in Figure 2. The assumption of random sampling from a specified 
population implies, for example, that the group formed by combining 
individuals from levels 3 and 4 who took subset B was a random sample 
from tne same "composite" population as the group formed by combining 
individuals from levels 3 and 4 who took subset C. Each pair of groups 
sampled from a common composite population was assumed to have the 
same mean and standard deviation on the underlying ability metric and 
thus comprised equivalent groups. 

The paper was simply descriptive of the method and presented no 
data suggesting how well it worked. Reference was made to an unpub- 
lished simulation which apparently yielded favorable results. The 
paper's primary contribution to the current research is in its sugges- 
tion of a rather creative composite of simple procedures. 



«r Conclusio ns 

The research reviewed has been useful in suggesting potential 
methods of performing the act of item linking. Several data 
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collection designs were suggested. Several methods of establishing 
the transformations were also suggested and served to clarify the 
fact that, for IRT models, only the- major-axis procedure is appro- 
priate. Finally, the studies reviewed suggested several criteria of 
linking adequacy. They served primarily to suggest a distinction 
between criteria of calibration and of linking adequacy and to suggest 
some candidates for linking-quality criteria. 

The studies to date have not, singly or collectively, adequately 
dealt with the linking problem in general, however. Reckase (1979) 
attempted to compare methods of linking but his comparisons were 
primarily between transformation techniques not appropriate for link- 
ing. Ree 4 Jensen (1980) provided data relevant to the comparison of 
two data collection designs but the study was too small in scope to 
furnisn much information regarding the linking problem in general. 
The remainder of toe studies reviewed were primarily reports of how 
linking or equa&p«,*»ad been accomplished for an applied problem and 
provided little^Might into the general linking problem. The need 
for a broad investigation i.,to the general linking problem seems 
obvious if linking is to be done accurately and efficiently. 

The preceding discussion on the need to evaluate calibration and 
linking effectiveness separately was not intended to mean that cali- 
bration an-! linking are independent activities. The accuracy with 
which items are calibrated will have a definite effect on the accur- 
acy with which itMis are linked. If. due to poor calibration, the 
ab""cy levels of the groups are not accurately assessed, the trans- 
formation linking two groups will be in error. Similarly, the accur- 
acy with which items are calibrated is, to some extent, dependent on 
the linking pa-adigm used. 

It is thus important in a study of linking effectiveness to eval- 
uate not only the adequacy of the link but also the adequacy of Uem 
calibrition t'nder the various paradigms. Ultimately, it is the accu- 
racy w th which the common-metric item parameters are estimated tnat 
mi: determine the quality of the tests resulting from these itemff, 
and this accuracy should be evaluated. Causes of inaccuracy in these 
parameters must, however, be evaluated by partitioning them into the 
effects due to calibration and the effects due to linking. 



a 
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II. BASIC RESEARCH DESIGN 



There a-e three general approaches to evaluating competing stat- 
istical or psychometric methods such as those considered by this 
project: a thedretical study, a real-data study, and a Monte-Carlo 
computer simulation (Miss & Betz f 1973). In a theoretical study, 
a statistician or psychometric ian f working from a basic statistical 
model, analytically derives the relevant characteristics of the 
various methods and then compares them. An example of this method 
was given by Lord (1971) in which he analytically derived several 
psychometric characteristics of a testing strategy. The theoretical 
method provides exact answers to theoretical questions but is usually 
limited to simple comparisons and comparisons made simple by restric- 
tive assumptions. 

Real-data studies answer different kinds of questions than do 
theoretical studies. Rather than answering questions about psycho- 
metric comparisons, they answer questions regarding characteristics 
of people and interactions of people with testing methods. They, in 
themselves, cannot answer questions such as which method best recov- 
ers true parameters because, in real data, the true parameters are 
never known. They are, nevertheless, essential in determining char- 
acteristics to use in theoretical or simulation studies and as a 
verification of the results of such studies. 

A computer simulation is a modified theoretical study in which 
theory and data come together in a stochastic model simulating the 
responses of human examinees. Examples of a simulation study com- 
paring testing methods are provided by Vale and Weiss (1975, 1978). 
Examples of simulation studies comparing calibration techniques are 
provided by Ree (1978, 1979). The simulation method is often prefer- 
red to real-data studies because true parameter values are known and 
more information can be collected more quickly. It is often prefer- 
red over a theoretical study because less restrictive assumptions 
are required. The simulation method is only as good as the theory 
underlying it and the reality of the parameters behind it, however. 

To assure that the simulation results are meaningful, a simul- 
ation model must do two things: first, it must demonstrate a direct 
connection to the real-world problem that it simulates, and second, 
it must provide explicit answers to the questions of interest regard- 
ing the problem. The simulation models used in this project were 
anchored to the real world in two areas. First, the test items sim- 
ulated were defined to be similar (in terms of their item parameters) 
to Armed Services aptitude items likely to be encountered in an 
actual linking problem. Second, the populations of individuals taking 
the tests were defined to be similar in ability to populations likely 
to take Armed Services tests. These procedures are described in the 
first of two sections below. 
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To address the research questions of interest adequately, th 
simulations and subsequent analyses must be properly designed and 
executed. In the second major section below, the research quest! 
and the criteria used to evaluate the procedures are ntegrated i 
a concrete design for implementation of the study. 



Development of Simulation Models 

Specification of Items 

Analyses of ASVAB item par ameters. Two distinct sets of item 
parameter data were available i >r evaluation in preparation for the 
computer simulations. The first of these was an OGIVIArproduced IRT 
parameter set 'obtained from the -subtests of an experimental version 
of Armed Services Vocational Aptitude Battery (ASVAB) Form 3 adminis- 
tered to Armed Forces Examining and Entrance Station (AFEES) exam- 
inees- a sample of 500 examinees was used to obtain the IRT param- 
eters. Experimental Form 8 was a form of the ASVAB developed to 
parallel then-operational Form 7 (see Fruchter & Ree, 1977). The 
second set of data included the classical item parameters (i.e., the 
Item-total score correlations and proportion correct) obtained from 
new Forms 3, 9, and 10 of the ASVAB, administered , in a previous pro- 
ject, to groups of high school juniors and seniors. Each form was 
given to approximately 500 examinees. These parameters were trans- 
formed to IRT a and'b parameters using Urry's method of simple ap- 
proximation (jlnsema, 1976). Because all items were four-alternative 
multiple-choice items, the c parameters were all set to .25 

New ASVAB Forms 3, 9, and 10 differed from the old Forms 5, 6, and 
7 (and, hence, from Experimental Form 8 discussed above) in that three 
of the original 12 subtests were eliminated, two subtests were com- 
bined, and two new subtests were added. Thus, there remained seven 
subtests in common between the two sets of available data. One of 
these subtests, Numerical Operations, was a speeded test and was there- 
fore eliminated from consideration here because the logistic model is 
inappropriate for speeded tests. The six remaining subtests were Word 
Knowledge (WK), Arithmetic Reasoning (AR), Mathematics Knowledge (MK), 
Electronics Information (EI), Mechanical Comprehension (MC) , and General 
Science (GS). In the new Forms 8 to 10, the lengths of five of these 
subtests were increased by 5 or 10 items; only the electronics test was 
shortened (by 10 items). See Table 1 for the numbers of items avail- 
able in each of these subtests. These six subtests formed the basis 
for Comparisons between Experimental Form 8 and the new Forms 3 to 10. 

Table 2 presents summary statistics of items from the tests 
analyzed. The first four columns present values obtained for the 
first four central moments on the subtests of Experimental Form 8. 
The remaining four columns show values of the four moments obtained 
by pooling items from the new ASVAB Forms, 8, 9, and 10. 
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Table 1. Number of Items in the Two Sets of Item Parameter Data 



New Forms 8, 9. 10 

Experimental Within Across All 

Fonn 8 One Form Forms Available 



Word 








Knowledge (WK) 
Arithmetic 


30 


35 


175 


Reasoning ( AR) 
Math 


20 


30 


180 


Knowledge (MX) 
Electronics 


20 


25 


75 


Information (EI) 
Mechanical 


30 


20 


60 


Comprehension (MC) 
General 


20 


25 


75 


Science (GS) 


20 


25 


75 



Mote: For WK and AR, a total of 6 different forms existed for each 
subtest (e.g., Forms 8A f 8B t 9A, 9B, 10A, 10B); only the first five 
forms for WK were available for analysis and comparison. Only three 
distinct forms of each subtest existed for the last four subtests 
listed. 




Mean proportions correct were higher on the new forms than on 
the experimental form. Values for each of the subtests clustered re- 
latively close to the median values, however. The standard devia- 
tions were approximately equivalent across forms, again clustering 
near their medians. Comparing median skews, the proportions correct 
appeared to be nearly symmetric in both data sets. A relatively wide 
range of individual values was observed, however. Kurtosis was quite 
constant both within and across data sets; all proportion-correct 
distributions were quite platykurtic. 

Biserial item-total correlations had relativelv consistent means 
and standard deviations. There was some variation In skew within data 
seti. In the experimental form, values of skew ranged from -.872 to 
In the new forms, the subtest skew ranged from -.432 to .089. 
'ians were negative and not very different from each other. 



.012. 
Both m 



Kurtosis showed a wide range in the new forms, ranging from -1.009 



to .39 
-.822 



It was less variable in the experimental form, ranging from 
-° » 1 20. The medians for the two data sets were not substan- 



tially different. 



it 



to thi 



was the IRT parameters, a, b, and e, that were most relevant 
s project, however, as they were to form the basis for the sim- 



ulation models. Mean a parameters were consistent within and across 
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Prop. 
Corr . 



Bis. 



Table 2. Item Parameter Summary Statistics 
from Experimental Form 8 and New Forms 8, 9, 10 



Test Mean 



WK 
AR 
MK 

EI 
MC 
GS 

Mdn 

WK 
AR 
MK 

EI 
MC 

GS 

Mdn 

WK 
AR 
MK 
EI 
MC 
GS 



WK 
AR 
MK 
EI 
MC 
GS 

Mdn 



Experimental Form 8 



New Forms 8, 9, 10 
(Pooled) 



SD 



Skew Kurtosis Mean 3D Skew Kurtosis 



165 



.555 
.518 .141 



.602 .152 -.309 
.369 
.172 

.598 .126 -.455 

.492 .165 .650 

.511 .132 .178 

.536 . 1U6 .175 

.700 .113 -.717 

.667 .071 -.080 

.588 .124 -.744 

.694 .089 -.872 

.625 .081 .012 

.629 .090 -.019 

.648 .090 -.398 

1.769 .536 -.124 

1.816 .573 .789 

1.602 .449 .706 

1 .486 .409 .444 

1.613 .388 -.129 

1.478 .627 1.019 



-.005 .686 .312 

.198 .772 -.484 

.510 .976 .525 

-.014 .567 .098 

.859 -.495 



.577 
.413 .650 

.306 .729 



.456 
.205 





716 


. 150 


-.338 


-1 .024 


-.568 


.656 


.130 


.121 


-.766 


01 t 


619 


.126 


.111 


-.664 


-.750 


.640 


.160 


-.230 


-.955 


-.545 


.625 


.133 


-.018 


-.693 


-.997 


.660 


.148 


-.450 


-1 .004 


-.830 


.648 


.140 


-.124 


-.860 




670 


.139 

• 'J7 


-.362 


-.466 


-.470 


.646 


.105 


-.432 


.390 


-.608 


.666 


.084 


-.089 


-.862 


-.145 


.508 


.136 


-.097 


-1.009 


-.822 


.518 


.110 


.089 


-.325 


.120 


.565 


.112 


-.044 


-.891 


-.308 


.606 


.111 


-.093 


-.664 


-.180 


2.171 


.996 


-.214 


-1.621 


.741 


1.999 


.904 


.212 


-1.498 


.500 


2.146 


.848 


.058 


-1.581 


-.190 


1.183 


.748 


1.356 


1.040 


-.713 


1.116 


.584 


1 8«4 


4.075 


1.433 


1.439 


.824 


1.112 


.012 


.160 


1.71,9 


.836 


.662 


-.743 



-.810 
-.572 
-.016 
-.-886 
-.633 
-.027 



-.333 .707 

-.126 .627 

.019 .545 

.080 .908 

.070 .788 

-.079 .764 



.309 -.375 

-.594 1.052 

-.226 -.063 

.639 -671 

.219 .128 

.825 -.376 



-.602 -.030 .736 .304 -.219 



9 
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Table 2 (Continued), item Parameter Summary Statistics 
from Experimental Form 8 and New Forms 8, 9, 10 



New Forms 8, 9. 10 
Experimental Form 8 _____ (Pooled) 

SD Skew Turtosfs 



Test 


Mean 


SD 


Skew 


Kurtosis Mean 


WK 


.113 


.067 


.482 


-.518 


AR 


.262 


.114 


.400 


.635 


MK 


.293 


.098 


.754 


-.411 


EI 


.170 


.069 


.626 


-.467 


MC 


.287 


.091 


.938 


.265 


GS 


.225 


.113 


-.368 


-1.039 


Mdn 


.224 


.094 


.544 


-.439 



Note: For the new Forms 8, 9, and 10, the c parameter was set to~25 
for all items. 



data sets; median values were 1.608 and 1.719. Standard deviations 
were quite variable within each data set, and the medians were mark- 
edly different (.492 vs. .836). The skews were typically oosUive but 
again somewhat variable;. There were wide differences in kurtosis 
within and across data sets, as observed for Ihe biserial correlation 
coefficients. 

Part of the variability in the item statistics for the new ASVAB 
forms was undoubtedly due to difficulties with, the item calibration 
procedure which caused a values to cluster at the upper limit. This 
clustering may be attributed to an artifact of the transformation 
procedure performed on the classical parameters from the new ASVAB 
forma. The theoretical relationship between the item-total biserial 
coefficients and the IRT a parameters is exponential, with high values 
for the former leading to very high values for the latter. At the 
upper end of the a distribution, then, the points are more spread out 
thiin they are at either the low end of the a distribution or the upper 
end of the distribution of bisarials. (In this transformation proce- 
dure, the maximum a value was defined to be 3*20 and any transformed 
a which originally exceeded that value was set to 3.20. 'iee Table 
3 for the numbers of items which reached this maximum value.) This 
phenomenon would produce a distribution of a parameters v/hich had a 
larger »:*an and standard deviation, was mo\e positively skewed, and 
was somewhat more platykurtic than might otherwise be found. This, 
of course, is exactly what was observed for\the new ASVAB forms. 

The item parameters for 4 Experimental Fory 8, were produced by 
the 0GIVIA program which relies on the same transformation for the 
initial parameter estimates. "There are two crucial differences 



O -44- 

ERLC 



49 



Table 3. Numbers and Percentages of Items From the New 
Forms 8, 9 f 10 With a Parameters Set Equal to the Maximum Value 



N in N with Percentage with 
Subtest Subtest Maximum a Maximum a 



WK 175 72 

AR 180 50 27.78 

MK 75 21 28.00 

EI 60 4 6.67 

MC 75 3 ".00 

GS 75 9 12.00 

Total 640 159 24.84 



between these parameters, however. The first is that the OGIVIA-pro- 
duced a parameters from Experimental Form 8 were restricted so that 
the maximum a during the first and second stages was 2.40. During 
the ancillary corrections, however, there was no bound on the a param- 
eters, and they were permitted to exceed 2.40 at this stage. The 
difference between the two procedures lies in OGIVTA's refinements of 
the item parameters based on values of the c parameters. For Experi- 
mental Form 8, as will tc discussed below, the c parameters were 
quitevariable. Although this was probably also the case with the 
"true^c's in the new ASVAB forms, all these c f s ware set to .25. , 
The effects of these restrictions and of the c parameters on the 
estimation 'of a is reflected in the observation that the 0GIVIA- 
produced a parameters d/id not cluster at the upper end of the dis- 
tribution, and none wer/e unreasonably large. Table 4 presents the 
numbers of items whose;^ parameters were equal to or exceeded 2.40 
after the aacillary corrections; these relatively small values should 
be contrasted with the numbers of items with a parameters set to the 
maximun (3.20) in Table 3. For Experimental Form 8, only two items 
had a parameters exceeding 3.20. 

The b-parameter tneans (Table 2) were slightly variable among 
subtests of the experimental *>>rm and quite constant in the new forms. 
Overall, the b parameters were slightly higher in the experimental 
form, indicating that either the items were more difficult or the 
AFEES examinees were lj&ss able than '.he high school students. Stan- 
dard deviations were variable within data sets, but their overall 
medians were essentially equivalent. Skews ranged from -.495 to .525 
in the experimental form and from -.594 to .825 in the new forms. 
Corresponding medians were .205 and .304. Kurtosis ranged from 
markedly flat to normal in the experimental form and from markedly 
flat to markedly peaked in the new forms; the kurtosis medians dif-* 
fered somewhat. 
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Table 4. Numbers and Percentages of Items From 
Experimental Form 8 With a Parameters Equal to or Exceeding 2. 40 



M in M with Percentage 

Subtest Subtest a > 2. 40 with a > 2.40 



WK 


30 


4 


13.33 


AR 


20 


3 


15.00 


MK 


20 


1 


5.00 


EI 


30 


0 


0.00 


MC 


20 


1 


5.00 


GS 


19 


2 


10.53 


Total 


139 


11 


7.91 



Mote: One item from the original 20-item GS subtest was rejected by 
0GIVIA. Hence, IRT parameters were available for only 19 GS items. 



Moments of the c parameters were calculated only for the experi- 
mental form as all c values were set to .25 in the new forms. Means 
and standard deviations were relatively consistent about their 
medians of .244 and .094, respectively. Skew was typically positive, 
with one exception. Kurtosis was variable, ranging from quite flat 
to somewhat peaked. 

Table 5 presents intercorrelations among item parameters for 
Experimental Form 8 and new Forms 9, 9, and 10. For the new ASVAB 
forms where £ was not estimated bu,t, rather, set to .25, only the 
correlations between a and b could be calculated. The individual 
correlations exhibited considerable variation in all columns. The 
median of each column is presented at the bottom of Table 5. For 
Experimenta? Form 8, these medians were all essentially zero. For 
the new forms, the median ^-b correlation was -.433. 

Specification of a representative item domain . Tt appeared 
reasonable to assume that the item parameters summarized in Table 2 
represented, with a few exceptions, a fair picture of the item do-,< 
mains likely to be encountered In the world of military testing. To 
form a basis for the simulations, a representative domain of items 
had to be specified. As with most scientific problems, there was a 
tradeoff between fidelity and practicality. The most faithful pro- 
cedure would run all simulations on item sets representing each of 
the six subtests evaluated in Table 2. Practically,- however , this 
would limit the number of simulations that could be run on any one 
item set. The approach taken in this project began tyy evaluating the 
item parameter data presented above to determine how far the six sets 
could reasonably be collapsed. 
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Table 5. Parameter Intercorrelations for 
Experimental Form 8 and New Forms 3, 9, 10 



Subtest 



Experiment al Form 3 
a-b a-c b--c 



New Form s 8j, 9, 10 



WK 
AR 
MK 

EI 
MC 

GS 



.25M 
-.152 
.027 
.300 
-.526 
-.321 



.311 
.154 
.33 1 * 
.027 
.011 
.026 



.718 
.607 
.233 
.315 
.W 
.104 



-.659 
-.17? 

^037 
-.625 
-.3M9 
-.527 



Median 



-.063 



.018 



.06U 



-.438 



Note: The c parameter was~set to". 25 for all items in the New Forms 
3, 9, 10. Therefore, only the correlation between the a and b para- 



The a parameters of the new forms were plagued by extreme esti- 
mates in nearly one-fourth of the items (see Table 3). Comparison 
of the first three tests with the last three tests hints, at the extent 
of this problem. The safest route appeared to be to disregard the a 
parameters from the new forms and ec - ntrate on those from the ex- 
perimental form. A single domain witn mean a of 1.6 and a standard 
deviation of .49 seemed reasonable. Skew and kurtosis values ap- 
peared to be nearly rectangularly distributed with few clusters. This 
suggested either one or six separate distributions. Six distributions 
seemed to be an extreme number to simulate Just to capture differences 
in skewness and kurtosis. Median values were thus used. For the 
computer simulations, then, a was specified as having a mean of 1.60, 
a standard deviation of .U9,"skew of .58, and kurtosis of .16. 

Although the medians of most of the b parameter moments were 
similar across the two forms, none of the distributions were appro- 
priate for an adaptive testing item pool. Since adaptive testing 
is one of the major reasons for interest in IRT, the difficulty dis- 
tributions were extensively altered for simulation. An item pool 
often considered ideal for adaptive testing has b parameters rec- 
tangularly distributed between b=-3.0 and b*3.0.. Such a distribution 
has a mean of 0.0^ a standard deviation of 1.73, a skew of 0.0, and 
kurtosis of -1.2. It is not unreasonable to expect item writers to be 
able to produce items similarly distributed. To allow for the prac- 
tical consideration that more weight will undoubtedly be given to the 
center of the distribution, these specifications were relaxed somewhat. 
Thus, the b distribution used for the simulation was specified to have 
a mean of 0.0, a standard deviation of 1.5, a skew of 0.0, and a kur- 
tosis of -1.0. 



meters could be calculated. 
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For input into the computer simulations, the £ parameter distri- 
bution was specified to be as it was for Experimental Form 3. The 
parameters were: mean .?4, standard deviation .09. skew .54, and 
kurtosis -.MM. Because the median inter-parameter correlations were 
essentially zero for Experimental Form 8, uncorrelated parameter dis- 
tributions were used for the simulations. 

Item parameters were generated from the specified mean, variance, 
skew, and kurtosis using the power method described by Fleishman 
(1978). This procedure allows random numbers to be generated with 
the first four moments asymptotically specified. 

Item parameters specified as described above did not always pro- 
duce acceptable items. A few items were so extreme in difficulty 
that either all simulated examinees responded "correctly or all res- 
ponded incorrectly. When this happened, it was not possible to esti- 
mate parameter values for the item and it had to be discarded at the 
calibration phase*. To prevent this from happening, items were re- 
jected at an earlier phase when they were first generated if the ex- 
pected proportion correct in a standard normal population was below 
.03 or above .97. This expected proportion correct was obtained 
from Equation 2 (From Owen, 1969, Eq. 6.2). 



P = c + .5 (1-c) [1-erf(D)l [2] 
where D = b [2(a"~+1 )]~ 1/2 



x 

and erf(x) = 2 (tt)" 1/2 f exp(-t 2 ) dt 

0 

Rejection of items in this manner was expected to affect the 
distributions of the item parameters such that the moments would not 
be exactly as specified in the preceding paragraph. Since moments of 
the true parameters were needed for evaluation of some of the linking 
methods, a simulation was run to estimate these moments. In' this 
simulation, 10,000 acceptable items were generated using the proce- 
dure described above. The first four moments were .calculated for the 
three item-parameter distributions. For the a parameters, the mean, 
standard deviation, skew, and kurtosis, respectively, were 1.585, 
.488, .602, and .220; Tor the b parameters they were .227, 1.337, * 
.079, and -.995; for the c parameters they were .240, .090, .527, 
and -.449. The only noticeable changes resulting from thia rejection 
were in the t> parameters; the mean rose slightly and the standard 
deviation and" skew dropped slightly. 
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Specification of Ability Distribution s 

The objectives of the analysis of the AFEES ability distributions 
were threefold. The first was to obtain parameters of ability dis- / 
trrbutions for use in simulation models. Since one link between sing- 
ulation and the real world is the ability distribution which generates 
the response vectors, the parameters describing this distribution X 
should, as closely as possible, reflect the current AFEES examinee pop- 
ulation:', The second objective was to determine whether the AFEES exam- 
inees were sufficiently variable in mean ability to make item cali- 
bration raVe efficient by non-random assignment of experimental items. 
The final objective wa> to determine if the AFEES examinees were 
sufficiently similar t^at the equivalent-groups method could be effec- 
tively applied using the AFEESsjjs the experimental sampling unit, even 
though that Would violate a basic assumption of the method. 

Examinee data ^available, T^e primary data available for analy- 
sis consisted of ~mHber-correct Acores of 500 applicants from each of 
the 65 Continental United States \C0NUS) AFEES on 12 subtests of ; 
ASVAB-7 randomly selected from te$ts administered during calendar year 
1979. Six of the ASVAB-7 subtests were deleted from the analysis ■ 
either because they were speeded tests or because they had been elim- 
inated in the newer versions of the ASVAB. Fifty-six cases, in which 
keypunch errors were encountered, were deleted from the 32,500 cases 
available for analysis, leaving a total of 32,444 cases for further 
analysis. These deletions were essentially random and no single AFEES 
lost more than*three cases to such errors. 

Additionally, data from a sample of 500 applicants tested on an 
experimental version of ASVAB-9 were available in summary form. These 
data consisted of grouped frequency distributions of modal Bayesian 
latent trait estimates from the item calibration program, 0GIVIA. 
They were collected during calendar year 1978. 

i \ 

Score data ava ilable, ' Meally, latent trait estimates of abili- 
ty should bemused to evaluateUhe distributional characteristics of 
the underlying trait. The individual item response vectors needed to 
compute latent trait ability estimates were not available for analy- 
sis, however. The raw number-correct scores that Comprised the pri- 
mary data set were less than optimal for evaluation! of ability dis- 
tributions for several reasons* One major problem with using number- 
correct scores is that different response patterns can result in the 
same number-correct score. Whert test items differ in their charac- 
teristic functions, differing response patterns to a set of items, 
each containing the same number of correct repdnses, can result in 
differing ability estimates. The effect of this is that the shape 
of the distribution of number-correct scores may differ from that of 
the underlying ability. 
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If IRT item parameters are available for a set of items, the 
test characteristic curv.e can be computed. This curve relates abil- 
ity levels to true scores and can be used to approximate ability 
levels from number-correct scores. The item parameters were not 
available for ASVAB-7, however, and this transformation was not 
possible. The ability distributions were thus developed by simply 
standardizing the number-correct scores. The shape of the distri- 
bution of standardized scores would be jorrect if the test charac- 
teristic curve was linoar. The degree to which this was true in the 
available data was not readily assessable. 

The limited set of data available from the experimental form of 
ASVAB-8 did, however, provicie an avenue for verification that the 
distribution shapes were reasonable/ Although these data were not 
sufficient to draw aoy conclusions regarding differences among AFEF r , 
they were adequate for evaluating the representativeness of the th<rj 
and fourth moments. 

Raw-score analysis . The parameters of the ability distri utions 
for each subtest were estimated from the first four central m>r snrs 
of the total AFFES sample. The means and variances were set t> zero 
and one, respectively, to facilitate subsequent analyses. 7* ie 6 
presents the skew and kurtosis for each ASVAB-7 subtest. Witn the 
exception of Word Knowledge and Electronics Information scores, which 
had slight negative skews, the remaining subtest scores had slight 
positive skews. Almost all subtest scores exhibited marked platy- 
kurtosis. 



Taule 6. Overall Skew and Kurtosis 
ASVAB-7 Number-Correct Scores (N=32,M44) 



Subtest 


Skew 


Kurtosis 


WK 


-.im 


-.991 


AR 


.162 


-.850 


MK 


.328 


-.717 


EI 


-.213 


-.2U7 


MC 


.383 


-.429 


GS 


.259 


-.560 


Median 


.210 


-.634 



Because of the extreme flatness of the observed-score distri- 
butions, a check was made to ascertain whether this was due to out- 
liers or whether it represented the true shape of the distribution. 
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The raw-score frequency distributions of a random sample of approxi- 
mately of the total AFEES sample for each ASVAB subtest are pre- 
sented in Figures 3 to 8. it is apparent from the figures that the 
observed flatness was not an artifact caused by a clustering of scores 
at the endpoints. Thus the platykurtosis of the ability distributions 
is a realistic representation of the actual shape of the distribution. 
An earlier study by Fruchter and Ree (1977) describing the psychometric 
characteristics of experimental ASVAB Forms 8, 9, and 10 compared to 
operational Form 7B presented descriptive statistics from a sample of 
AFEES examinees similar to the present sample. Their results indicat- 
ed the same trand toward platykurtosis as was found in this project. 

Differences among AFEES . Two of the objectives of the AFEES 
evaluation centered cn the determination of the differences in abil- 
ity distributions among AFEES. Raw scores for all subtests were 
standardized by a linear transformation to a mean of zero and a stand- 
ard deviation of one, as discussed above, to approximate the metric of 
a standard ability continuum. This standardization was done across 
all 32,444 examinees. The first four moments of these standard scores 
were then computed within each of the 65 AFEES groups. 

Table 7 present summary statistics on the AFEES for each ASVAB 
subtest. The colunns are the four central moments computed across 
AFEES (i.e., mean, standard deviation, skew, and kurtosis). The rows 
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Figure 4. Raw Score Frequency Distribution 
16O-1 Arithmetic Reasoning 
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Figure 5. Raw Score Frequency Distribution 
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Figure 6. Raw Score Frequency Distribution 
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Figure 8. Raw Score Frequency Distribution 
General Science 
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represent the 4SVAB subtests end within each subtest, the mean, 
standard deviation, minimum and maximum of the first four moments. 
The mean of the means was zero in all- cases since the computation was 
done on standard scores. The mean of the standard deviations was 
somewhat less than one. This is because part of the overall variance 
is due to variance among subgroup means which is not included in this 
calculation. 

The standard deviations of the AFEES means^and standard devia- 
tions are of interest in that they provide information regarding 
the error that will be introduced into the linked b and a parameters, 
respectively, if differences among the AFEES are not controlled in 
the linking process. If t for example, the equivalent-groups method 
was used and sampling was done non-randoroly by assigning different 
booklet to each AFEES, these standard deviations are related to the 
root-raean-square (RMS) p&"ameter error that would be introduced into 
the item parameters (the square of these values would be added to the 
mean-square error). The standard deviations of the AFEES means 
ranged from .201 to .244 which indicated that the AFEES were rela- 
tively homogeneous with respect to deviations about their central 
values. The mean-square error expected to be add'id to the linking 
error on the b parameters when sampling by AFEES was thus on the 
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order of .040 to ,060. Likewise, the rcnge of linking error expected 
to be added to the a parameters was on the order of .001 to .003 
(squared standard deviation* of the AFEES standard deviations). 



Table 7. Standard-Score Summary Statistics 
Across AFEES for ASVAB-7 Subtests 



AFEES Monents by Subtests 



Subtest 




Mean 


SD 


Skew 


Kurtosis 


WK 


Mean 


.000 


.971 


-.100 


-.878 




SD 


.235 


.011 


.215 


.158 




Min 


-.634 


.876 


-.512 


-1 .119' 




Max 


.385 


1 .060 

y 


.557 


-.108 


AR 


Mean 


.000 


.975 


.162 


-.739 




SD 


.222 


.037 


.219 


.219 




Min 


-.455 


.852 


-.350 


-1 .026 




May 


.428 


1 .056 


.725 


.157 


MV 


Ma on 


000 


.978 


.321 


-.620 




SD 


.201 


.019 


.202 


.306 




Min 


-.340 


.798 


-.081 


-1.078 




Max 


.109 


1.059 


.718 


.212 


EI 


Mean 


.000 


.972 


-.188 


-.193 




SD 


.230 


.019 


.152 


.253 




Min 




.831 


-.607 


-.598 




Max 


.381 


1.056 


.818 


1.198 


MC 


Mean 


.000 


.959 


.381 


-.307 




SD 


.211 


.050 


.196 


.365 




Min 


-.513 


.791 


-.073 


-.833 




Max 


.115 


1.091 


.820 


.911 


GS 


Mean 


.000 


.971 


.268 


-.180 




SD 


.225 


.033 


.167 


.215 




Min 


-.113 


.882 


-.097 


-.867 




Max 


.382 


1.031 


.680 


-.169 



Comparisons of the overall skew and kurtosis given in Table 6 
for each subtest with the skew and kurtosis for AFEES by subtest in 
Table 7 revealed virtually the same magnitudes and directions for the 
respective subtests. This indicated that the distributions of scores 
within AFEES were very similar in shape to the distributions over all 
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AFEES. Thus thV four central moments computed for each subtest 
appeared* to be reasonable estimates of the unknown true population 
values. 

Modal Bayesian trait estimates . A parallel analysis was conduct- 
ed on the available grouped frequehcy data provided by the IRT cal- 
ibration program by computing the first four central moments for each 
ASVAB subtest. The formulas used to compute the moments were °imply 
generalized versions of the formulas for ungrouped data where each 
element in the sum was the midpoint of its class interval weighted by 
the frequency of its occurrence. 

As with the number-correct scores, tlje grouped modal Bayesian 
estimates exhibited consistent platykurtosis which ranged from -.607 
for Arithmetic Reasoning to -.860 for Word Knowledge (see Table 5). 
Similar^, a slight skew was observed. Comparison of Table 8 t which 
shows the four central moments for the ASVAB-9 modal Bayesian esti- 
mates, with Table 6 for the ASVAB-7 number-correct scores, indicates 
- - - ^that the _skews_ observed for the modal Bayesiau estimates, were similar 
to those of the number-correct" scores observed over all AFEES. Agree- 
ment between data sets on observed kurtosis was also apparent. Both 
data sets agreed in direction and magnitude of the observed kurtosis. 



Table 8. Mean, Standard Deviation, Skew, and Kurtosis of 
ASVAB-8 Modal Bayesian Ability Estimates (N=500) 



Subtest 


Mean 


SD 


Skew 


Kurtosis 


WK 


.086 


• 85'» 


.177 


-.860 


AR 


.094 


.805 


.164 


-.607 


MK 


.110 


.736 


.195 


-.643 


EI 


.078 


.807 


.026 


-.623 


MC 


.087 


.785 


.115 


-.782 


GS 


.137 


.729 


.280 


-.702 



Overall, analysis of the modal Bayesian ability estimates tended 
to confirm the results of the number-correct score data and support 
the observation of flat ability distributions on ASVAB subtests. Al- 
though restricted to a fairly small sample (MsSOO) compared to the 
number-correct data, the modal Bayesian estimates were the preferred 
type of data. The results from these two rather disparate data sets 
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tended to reveal the same general trends; therefore, the actual shapes 
of the underlying trait dimensions appeared to be adequately rep- 
resented. 

Specification of distributional parameters . To form a basis for 
the simulations, the ability data i suramari zed in the preceding sec- 
tions had to result in specification of a set of parameters to define 
the simulation n*o<*els. To accommodate the simulations to be perform- 
ed, two sets of ability parameters were needed. The first set re- 
quired ability parameters for the overall AFEES distribution and the 
second set 'required ability parameters to describe each individual 
AFEE£< ' ' 

The data summarized consisted of six ASVAB subtests, representa- 
tive of ability tests used by the Armed Services. To specify the 
parameters for the simulations, the first question to be answered was 
whether a single set of parameters could ^present all of the tests 
or whether several^ sets would have to be included in the simulations. 
To answer this question, the skews and kurtoses of the overall distri- 
butions were of primary interest as the means and standard deviations 
were to be set to zero and one. Tables 6 and 8 allow comparisons 
between the skews and kurtoses of the ability distributions on the 
six subtests. Although many of the differences between subtests were 
statistically significant due to the large sample sizes, the absolute 
magnitude of the differences wa$ relatively small, A general state- 
ment could be made that the ability distributions were, in most 
cases, symmetric and flat. The decision was thus made that a single 
subtest's ability distribution could be taken as representative of 
Armed Services ability tests. 

The question remaining was how to choose the most representative 
test. Of two possible solutions, one was to use median values for 
the distributional parameters across the six subtests, while the other 
was to select a single test as representative and use its parameters 
throughout. It is possible, under the first approach, to get im- 
possible combinations of parameters. Also, across AFEES, the param- 
eters thus defined would have less variability than a typical set of 
parameters. A single test was thus chosen as representative of the 
ASVAB subtests. 

To choose that subtest, the subtests were rank ordered according 
to their absolute deviations from the median of the overall skew and 
kurtosis values shown in v Table 6. General Science and Arithmetic 
Reasoning ranked closest to the median for skew. General Science and 
Math Knowledge ranked closest to the median for kurtosis. 

Across AFEES, it was essential that the test chosen as repre- 
sentative have representative variability in mean and standard devia- 
tion of the individual AFEES groups. The six subtests were thus 
rank-ordered on the standard deviation of their means across AFEES 
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and the standard deviation of their standard deviations across AFEES. 
From the data in Table 7, it was determined that the typical tests in 
terms of variability of means were Electronics Infcwmation and Gen- 
eral Science. In terms of stand^d deviations, the most typical were 
Hath Knowledge and Word Knowledge 

Of - the four comparisons. General Science was one of the most 
typioal subtests in three out of four comparisons, the most of any 
3Ubtest. Its parameters were thus selected for the simulation model. 
The overall ability parameters were thus mean of zero, standard devi- 
ation of one, skew of .259, and kurtosis of -.560. The four param- 
eters from each of the 55 AFEES on the General Science test we^e used 
for individual AFEES simulations. These are listed in Appendix Table 
A-1. 

Basic Data Sets 

Four basic item linking paradigms were to be evaluated. It be- 
came apparent from review of the Armed Services calibration environ- 
ment that practical administration constraints might, in a predict- 
able fashion, violate a basic assumption of at least one of the para- 
digms. Specifically, the assignment of experimental test booklets to 
AFEES examinees would possibly be done non-randomly. In the limiting 
case, it is possible that each AFEES might receive a single form of a 
test booklet and, further, might be the only grdup to receive that 
booklet. Thus, two distribution schemes were simulated, the ideal 
case reflecting random distribution of test booklets and the worst 
case expected, that of non-random distribution. 

The additional possibility existed that items might be calibrated 
on a selected group of examinees, such as those already in the Arwed 
Services. A basic data set reflecting this situation was thus also 
developed . 

Randomly sampled examinees . For the random-distribution case, a 
two-way grid composed of 12 combinations of test lengths of 20. 35, 
50, and 65 items with examinee group size:; of 500, 1,000, and 2,000 
formed the framework of the design. Within each cell, the specified 
number of examinees was randomly drawn from a standard ability popu- 
lation with a skew of .259 and a kurtosis of -.56G. A staple of items 
was then drawn with parameters following the domain distribution spec- 
ified in an earlier section. This process was repeated five times 
in each cell, with new random samples of examinees and items each 
tinr.e. 

Syst ematically s a mpled examinees . The non-random procedure was 
similar to the random procedure except that for each replication, one 
of the 65 AFEES was randomly selected (with replacement) and its dis- 
tributional statistics on the General Science test were used to de- 
scribe the population from which examinees were drawn, in a real 
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calibration design, the non-randoraness of the sampling procedure 
would probably be less extreme. Each test booklet would probably be 
distributed over several AFEES groups. The exact distribution plai. 
could not be predicted, however, and the limiting case was chosen to 
provide a bound to the errors that could be expected. 

Selected examinees . One row of the basic matrix corresponding 
to 1 ,000 examinees was simulated at the standard test lengths of 20, 
35, 50, and 65 items for the selected examinee condition. As with 
the other conditions, five replications were done in each cell. In 
this condition, however, 1,500 examinees were generated and sorted on 
the basis of the number-correct score. One thousand individuals 
with scores at or above the score of the individual ranked 1,000th 
were selected. This procedure was done to simulate examinees se- 
lected on the basis of a cutting score and the cutting score was 
chosen to be similar to that used by Ree (1979). 

Composite sets of item s. To evaluate the effects of linking 
procedures, items from morTthan one calibration must be combined and 
linked To facilitate this evaluation, t«?o types of composites were 
assembled from the basic data sets. In the homogeneous condition, 
the five sets in each cell of each 3x4 or 1xU matrix were linked to- 
gether. In cells containing 20-item sets, 100 items were linked to- 
gether; in cells containing 55-item sets, 325 items were linked to- 
gether. Composite sets so assembled provided dat i regarding linking 
adequacy when all sets included were homogeneous with regard to test 
length and size of calibration group. 

The second type of composite, the heterogeneous condition v was 
formed by selecting 20 items from one set of each of the 12 cells of 
the 3xU matrix to form a set of 240 items. Items beyond the first 20 
in a set were ignored. This procedure resulted in five composites 
from each matrix, one corresponding to each replication within uhe 
cells This type of composite yielded data regarding linking ade- 
quacy'when sets included were heterogeneous with respect to test 
length and calibration group size. ? 

Calibration of items . For each of tie 1U0 administrations enum- 
erated above, item responses were generated using true ability levels 
and true parameters according to the following algorithm: 

1. The probability of a correct response to an 
item, given an individual's ability and the 
true item parameters, was calculated using 
Equation 1 . 

2. A random number from a rectangular distribu- 
tion on the range ft om zero to one was drawn. 
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3. A response of "correct" was assigned if the 
probability exceeded the random number. 
Otherwise, a response of "incorrect" was 
assigned. (See Ree f 1980b t for a more detailed 
description of this type of procedure) 

The item response data thus created were used as input to the item 
calibration program OGIVIA. This program provided item parameter 
estimates and modal Bayesian ability estimates (using a standard nor- 
mal prior ability distribution). 

For each of the administrations, the following statistics were 
recorded : 

1. The first four moments of the population ability 
distribution. 

2. The true parameters for each of the items. 

3. The estimated parameters for each of the items. 

4. The true ability level for each examinee. 

5. The estimated ability level for each examinee. 

6. The response of each examinee to each item. 

These data formed the basic data sets used for analyses of the four 
basic linking methods. How the same data were used for the four dif- 
ferent linking methods is described below. 



Evaluative Criteria 

Three categories of evaluative criteria were used to evaluate 
the adequacy of calibration and linking. The first category included 
the usual f idelity-of-estimation criteria used in previous studies. 
They were used in this study to provide simple indices of estimation 
accuracy and to provide a means of comparing the results of this study 
with those of previous studies. 

A study of calibration and linking must consider that, ultimately, 
the interest will be in the effects of different techniques on the esti- 
mation of ability. Fidelity-of-estimation criteria do not afford any 
direct infe ence regarding accuracy of ability estimates. To amelio- 
rate this problem, the last two categories of criteria evaluate the 
asymptotic (i.e., infinite test length) characteristics of ability 
estimates and the efficiencies with which various techniques approach 
these characteristics. 
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Fidelity of Parameter Estimation 

Bias , Perhaps the most basic of the fidelity criteria is bias 
in the distributions of the item parameters. To assess the bias in 
the distributions of the par ameters , means and standard deviations of 
the true and estimated parameters were calculated for all conditions 
of interest. The biased formula for the standard deviation was used, 
as it was throughout this research. 

Absolute error . The mean absolute difference between true and 
estimated parameters was calculated and is referred to throughout 
this report as the absolute error. Algebraic error or bias may can- 
cel out even though severe errors of estimation exist. Absolute 
error is one method used to eliminate this cancelling effect. 

Root-mean-square error . Root-mean-square error is an index 
similar to absolute error except it is computed by taking the square 
root of the mean of the squared differences between true and esti- 
mated parameters. The primary difference in effect is that the root- 
mean-square index weights the extreme deviations more heavily than 
does the absolute index. Root-mean-square error was calculated for 
all conditions of interest. 

Correlations . Correlations between true and estimated item 
parameters were calculated. The simple Pearson product-moment corre- 
lation was used. This index can be thought of as a complement to 
indices of algebraic bias. The bias indices are sensitive to changes 
in the location of the distribution of parameters. The correlation is 
sensitive to differences in relative position between corresponding 
true and estimated parameters. 

Characteristics of Asymptotic Ability Estimates 

Most of the desired knowledge that pertains to the ability to 
estimate a trait can be indexed by the bias and the precision with 
which the trait is estimated. In an effort to evaluate the bias due 
to calibration it is helpful to think of two trait metrics for the 
given trait of interest. The theta (9) metric can be defined as the 
absolute or criterion metric on which the true parameters are anchored 
and along which the response probabilities are accurately described by 
the model incorporating the theta level and the item parameters. A 
second metric, gamma (D, can be described as a one-to-one trans- 
formation of the theta metric produced by scoring item responses using 
item parameters other than those true parameters of the theta metric 
The gamma level corresponding to a given theta level could be deter- 
mined, conceptually, from administering a test scored using the errant 
parameters an infinite number of times. Each theta value would thus 
asymptotically converge on a single gamma value. The difference be- 
tween gamma and theta at any value of theta could be defined as the 
bias due to use of the errant parameters. 
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Practically, it is impossible to administer infinite-length tests 
or to repeat a finite-length test an infinite number of times. The 
theta-gamma transformation can be determined by more practical means, 
however. The maximum likelihood estimate of theta, which is asymp- 
totically unbiased, can be obtained by finding the root in theta of 
the following equation given by Birnbaum (1968, p. *459): 



Zv CD \ (e -V 3 - IrV*- =° [3] 
g=i g=i 

where: D' = 1.7 

w [9] = Da ¥[<Da (9-b ) - ln(e )] [4] 
g g g g g 

and u = 1 for a correct response to item £ and 0 other- 

8 wise. 



If each item were repeated r times, Equation 3 could be written as: 

m * r m v~i w [ 9 ]u 

g=1 h=1 g=1 h=1 



or 



m r i r 
W [0] 



g=1 g=1 n=1 



[6] 



or 



m m r 

g=1 g=1 



where P = the observed proportion of correct responses to 
g 

item £ in r repetitions. 

If the number of repetitions were allowed to become infinite and the 
three-parameter logistic model holds, 
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P s P (e) s c + (1-c ) [Da„ (6-b )] C8] 
g g g g g S 

Computing P as above, the root of the likelihood equation is found at 

A g ' 

9=9. If, however, P g is calculated using 9 and the errant item pa- 
rr, aeters a , 6 , and c , the 'root of Equation 7 is found at 9 = r, If 
-g ~g -g 

the errors of calibration are zero or the estimated parameters are 
consistent with the true parameters, the transformation of theta to . 
gamma will be linear. When this is not the case, as in almost all 
real calibration situations, the transformation will be non-linear. 

Th& function transforming theta to gamma completely describes 
the asymptotic effect of item parameter error on ability estimation. 
This empirical function has no simple descriptive parameters, how- 
ever, and a method to condense many functions into table values was 
needed for this research. To accomplish this, a standard normal den- 
sity function was taken as a reference theta population and de- 
scriptive parameters of the corresponding gamma population were tabu- 
lated. Methods 'of calculation are described below. 

Mean and st andard deviation. For each calculation of the mean 
and standard deviation "of "gamma, 47 theta values equally spaced be- 
tween -'4.6 and 4.6 were chosen. At each of these values the stand- 
ard normal density, the gamma value, and the squared gamma value were 
obtained. The, gamma and squared gamma values were each numerically 
integrated jointly with the density using Simpson's one-third rule of 
quadrature to obtain the expected value of gamma and the expected 
value of gamma squared. The mean was taken as the former. The stan- 
dard deviation was obtained by using the formula for expected values. 
To accommodate numerical limitations of the computer used, gamma was 
bounded between -5.0 and 5.0. 

Absolute a nd root-mean-square error . Mean absolute and root- 
mean-square errors were calculated in a manner similar to the mean 
and standard deviation. At each of -the U7 theta points, the abso- 
lute and squared differences between theta and gamma were calculated. 
The expected values of these quantities were obtained through joint 
numerical integration with the normal theta density function. The 
expected absolute error was the mean absolute *rror. The root-mean- 
square error was taken as the square root of the expected value of 
the squared difference between gamma and theta. 

Correlation. The correlation between theta and gamma was com- 
puted as "an "index of linearity of the transformation. At each of the 
47 theta values, the cross-product of theta and gamma was computed. 
Since all of the joint theta-gamma density falls along the regression 
function, this cross-product, jointly integrated with the normal 
theta density, produces the expected cross-product. The correlation 
between theta and gamma was computed from this value and the known 
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and previously computed means and standard deviations of the theta 
and gamma distributions. 

Efficiency of Ability Estimation 

Although the transformation function provide- i measure of the 
bias incurred through use of errant parameters, *lls little about 
the precision with which the parameters permit an jtimate of the 
trait levels. An index closely related to precision of estimation 
is the statistical Fisherian information, Fo*- a given test scor- 
ing function at a specified level of a trait, theta, this information 
can generally be exp ' asf : the ratio of the squared derivative of 
the expected /alue of the scoring function to the variance of the 
scoring function at the specitied level of theta: 



2 



1(0) = ^ [9] 



When the score, x, is a linear combination of 0-1 item^ responses, the 
•components of the information equation can written as: 



m 

^E(x!e) = ^w g E(u g |e) C10] 
g=i 



m 

y~ w p (8) 

g=1 



m 



> w P* 



g (9) 



g=1 



where 



x 9 Zj 



p (e)Q (e) [11] 
g 8 g " 



g=i 



rf" 



w = a weight assigned to item g 
8 



Jijgned to it 
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and Pg(6) z (1-c g ) Da g ^ D - g ^- b g )] - 

Birnbaum .(lifc8) discussed choosing the weights to be best or f1 lo- 
cally ,t best in the sense that they would make the information of the 
linear combination maximal at a given value of theta. In cases where 
guessing is not possible, these weights are simply: 
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w = Da H2] 
g g 

In cases where guessing is effective, the weights change as a func- 
tion of theta and are given by Equation 4 above. Weights obtained 
ur a given level of theta would, when used in linear combination, 
provide maximum information for making discriminations between two 
theta levels arbitrarily close to the theta level of interest. When 
true item parameters are used, information computed in this manner 
is equal to the test information at the theta level of interest ob- 
tained by summing the item information values at that point. 

The information in any linear combination can be evaluated; 
therefore, it makes sense to evaluate the information available at a 
given level of theta from items with errant parameters by evaluating 
the information in the linear combination obtained by using the lo- 
cally best weights obtained through the errant parameters. This is 
done for a given theta level by first finding the corresponding gamma 
level. Weights are then determined using this gamma level in place 
of theta in Equation 4 and substituting the errant parameters for 
the true ones as in Equation 13: 

0 (?) = Da j,[D8 *<?-»> - (In 8 )3 [13] 
g / g g g g 

The information can then be determined by substituting 0 g (f) for w g in 

Equations 10 and 11. This information is interpretable on the sr 
scale as the true information, and the relative information of tests 
using t ie and errant parameters can be obtained by taking their ratio. 
The reciprocal of this ratio can be interpreted as the relative numbers 
of items with true and errant parameter* necessary to achieve an equiv- 
alent level of measurement precision at the specified trait level. 

Informatio n, The information function produced by> the method 
described above is nearly as awkward to work with as the regression 
functions described earlier. The information function data were thus 
condensed in a-similar manner. For each condition of interest, in- 
formatic was calculated at the 47 theta points. Expected informa- 
tion was then obtained by jointly integrating these information values 
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with the standard normal density function. The resulting value repre- 
sented the average amou.it of information that would he extracted by 
the test for an examinee selected at random from a standard normal 
r>pulation. To provide a basis for comparability, information per 
item is presented throughout this report. 

Relative efficiency . When comparing information extracted by 
different procedures, the comparison is often done in terms of a 
ratio. The ratio of information from two tests is an index of rela- 
tive efficiency. If the ratio of Test A information to Test B infor- 
mation is .80, Test A is 80% as efficient as Test B. Test B would 
achieve an efficiency equivalent to that of Test A with only 80% as 
many items as it currently has. 

Whether an index will indicate calibration or linking error is 
dependent, in large part, on how it is applied. The indices pre- 
sented thus far have all been discussed as indicators of calibration 
error. The underlying concepts and the indices themselves may, how- 
ever, be used to evaluate linking errors by applying them to the case 
where multiple sets of items are calibrated separately and then link- 
ed together. 

The effects of calibration and linking errors are difficult to 
separate using fidelity or, asymptotic ability indices. They can be 
readily separated using the efficiency indices, however. Loss in 
efficiency is caused only by relative errors of calibration, not by 
constant errors. A linking error exists when the unit and origin of 
the trait resulting from the item parameters differ from the true 
unit and origin of the trait. Linking errors are constant within 
an item set; thus, they result in no loss of efficiency and are not 
usually considered a problem when all items are calibrated as a single 
set. If, however, two or more sets of items are calibrated separately 
and then combined into a single pool, errors constant within each set 
are now relative in the combined pool. The result will be a loss of 
efficiency. 

Loss of efficiency in a single item set is duS£ to calibration 
error. Loss of efficiency in a combined pool is due to both cali- 
bration and linking errors. The index of efficiency used in this 
study was information, and information is additive. If information 
contained in the combined pool is subtracted from the total inform- 
ation contained ir the individual pools, the value remaining is the 
information lost as a result of linking. The ratio of the informa- 
tion available using the linked parameters to the information avail- 
able using the true parameters yields an efficiency index of the 
linked items. The ratio of the information available from the linked 
parameters to the information available from the estimated parameters 
within sets yields an efficiency index of the linking procedure. 
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III. EVALUATION -OF THE BASIC DATA SETS 



Three basic data sets comprised the data on which most of the 
analyses reported here were based. Evaluation of these data served 
two purposes. First, they provided baseline data free of linking 
error for comparison in later phases of the study. Second, ths data 
provided substantial information regarding the characteristics of the 
calibration procedure used (i.e., OGIVIA). These data allowed a more 
comprehensive analysis than was available from previous research be- 
cause the evaluative criteria provided were both more extensive and 
more closely related to a test's capacity to estimate ability. 

As will be the case with all analyses presented, each data set 
will be discussed separately. Within the discussion of each set, the 
three categories of evaluative criteria presented in the previous 
section will be discussed. 



Randomly Sampled Examinees 
Fidelity, of Pa rameter Estimat ion 

Table 9 presents parameter Mas statistics for each of the three 
parameters, a, b, and c, for the randomly sampled calibration groups. 
Biai,, as used in this table, is the mean of the estimated parameters 
minus the mean of the true parameters. Means of values obtained from 
five calibrations are presented for each of the 12 cells in the cen- 
ter of each section of the table and row and column simple averages 
are presented in the margins. 

As can be seen from the first section of the table, the a param- 
eters exhibited substantial bias at short test lengths. At a length 
of 20 items, the estimates were high by approximately .6 units. This 
bias proc eded smoothly to zero by a test length of 65 items. No 
consistent change was observed in the amount of bias as the number of 
examinees in the calibration group increased from 500 to 2,000. 

The b parameters exhibited relatively little bias in any of the 
12 cells. The highest was .155 in the 20-item tests calibrated on 
500 examinees. As shown by the marginal averages, bias decreased 
slightly with increasing test length and sample size. The decrease 
was very slight, however, and as can be observed from the individual 
cell entries, was by no me^ns consistent. It may be observed that 
the errors for the b parameters were smaller than those for the a 
parameters. These comparisons are not readily interpretable, however, 
because the a and b parameters are on different scales. 

Bias in the c parameters was also quite small. No obvious trend 
with respect to group size was observed but bias did appear to 
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Table 9. Item Parameter Bias 
Basic Data Set — Randomly Sampled Examinees 



Sample Test Length 



Parameter Size 


20 


35 


50 


65 


Average 


a 500 


.594 


.292 


.095 


-.029 


.238 


1000 


.623 


.232 


.094 


.009 


.239 


2000 


.531 


.243 


.079 


.017 


.231 


Average 


.599 


.257 


.0*9 


-.001 





500 


.155 


.121 


.098 


. 102 


.119 


1000 


.114 


.123 


. 129 


.099 


.117 


2000 


.154 


.089 


.066 


.071 


.095 


Average 


.141 


.111 


.098 


.091 





500 


.017 


.024 


.001 


.006 


.012 


1000 


.014 


.023 


.011 


-.003 


.012 


2000 


.033 


.01 1 


-.004 


-.001 


.010 


Average 


.021 


.020 


.003 


.001 





decrease with increasing test length. Although not as consistent as 
with the £ parameters, this decrease was fairly consistent with in- 
creasing test length. 

Table 10 presents correlations between true and estimated item 
parameters for the randomly selected calibration groups. Each cell 
entry represents Fisher's r-to-z average of correlations obtained in- 
dependently in each of five calibrations. The marginal values are, 
likewise, r-to-z averages of the cell average^. 

These correlations ranged from .^35 to .684. Slight increases 
in correlations between true and estimated £ parameters with increas- 
ing test length and calibration group size are apparent in the first 
section of Table 10. The increases were not markedly consistent, how- 
ever, as may be observed both in the marginal and the cell entries. 

Similar observations can be made regarding trends in the b-param- 
eter correlations. Slight but consistent increases were observed in 
the marginal values. The individual rows and columns did n t all 
exhibit the same consistency, however. Although tha increases were 
slight (from .985 to .990), it should be noted that slight increases 
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Table 10. Parameter Correlations 
Basic Data Set— Randomly Sampled Examinees 



Parameter 



Sample 
Size 




Test 


Length 




Averag 


20 


35 


50 


65 


500 


.^35 


. 505 


. 632 


,647 


.561 


1000 


.645 


'.612 


.673 


.560 


.624 


2000 


.460 


.643 


.684 


.559 


.618 


Average 


.520 


.590 


.564 


.624 





b 



5:0 


.973 


.984 


.986 


.988 


.984 


'000 


^39 


.987 


. 989 


.992 


.-39 


2000 


.986 


992 


."92 


.990 


.991 


verage 


.935 


. 933 


. '*S9 


.990 





--)0 **" 7 ^60 , '65 . -^2 ,'i«U 

• :o3 .555 .360 ,c05 .541 

*000 . ^7 . 555 .5-5 .529 .-09 



"9' 7 ^2^ . 328 . 52^ 



:re :-noortant m correlations *s near "0 I.J as these. The correla- 
tional ;ata presentee he^e suggest that «-he p_ parameters are extremely 
jell estimated at all comDinations of test length -and calibration 
group size considered. 

Relatively consistent .-norovements in the ^-varameter correla- 
v. :ns *ere ocserved hs test length increased up to -a Length of 50 
.tens. \t 2 length of A 5 items, two :t the three correlations cropped 
3ligntly. Iwovenent with increasing rumple s^ze .ncreased to a size 
f nOO examinees. Increasing the sampie size +c ~Q n 0 resulted in no 
■nprovements. ~>ver^il« tie --oa-amete*' n or«j.at 1 ons were sligntly 
lower t*an those of tie ^2 parameters, differences :f approximately 
' v/^re oose r ve r i . 

:i resents averse absolut? errors for each parameter, 
-:ell values are simple averages of the five calibrations con- 
tjinei m *acn. The marginal values are cnoie averages of the ceil 
•/a Lies, *eiativeiv consistent lecreases in tne amounts of a-oararoeter 
t »r. : r w«^e -.oparent vith increasing test length and calibration srouc 
z ^ lecreases we^~ cooabiy >i se r ,c increases in Diss ooserved 

r» ^ ^ ^ n ^ 0 , * - «*r fl rpnAD - ^^"v^l i n r»orr^l it ' on . 



ERIC 



7/ 



Table 11. Absolute Parameter Error 
Basic Data Set — Randomly Sampled Examinees 



Sample 
Parameter Size 


20 


Test 

35 


Length 
50 


65 


Average 

- — 3 — 


a 500 


.839 


.642 


• U91 


.M55 


.607 


1000 


.775 


,531 


.450 


.U72 


.557 


2000 


.8M1 


• M99 


.UOU 


.419 


.5M1 


Average 


.818 


.557 


.1*1*8 






b 500 


• 31 1 * 


.298 


.285 


.262 


.290 


1000 


.239 


.271 


.275 


.2U7 


.258 


2000 


.316 


.196 


.209 


.233 


.238 


Average 


.290 


.255 


.256 


.2U7 




c 500 


.136 


.128 


.108 


.110 


.120 


1000 


.128 


.111 


.095 


.085 


.105 


2000 


.146 


.098 


.092 


.096 


.108 


Average 


.137 


.112 


.098 


.097 





Intuitively, these errors appear quite large because an a value of .8 
is considered adequate for adaptive testing, and an average error this 
large was observed in the first column. 

The second section of Table 11 shows slight and inconsistent de- 
creases in absolute error of the b parameters with increasing test 
length and calibration group size. The decreases were somewhat more 
consistent with increasing calibration group size; with the exception 
of the 20-item test length, absolute errors decreased with increased 
sample size. 

Errors in the o parameters generally decreased with increasing 
test length and group size. This trend appeared to be somewhat more 
consistent relative to group size than to test length. Noting that 
an average o parameter is approximately .2 t the errors observed in 
Table 10 typically exceeded half this amount and seemed quite large. 

Table 12 presents root-mean-square errors of estimate for the 
item parameters. Root-mean-square error can be interpreted in a 
manner similar to absolute error. The marginal averages in Table 
11 were computed as the square root of the mean of the squares in 
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Table 12. Root-Mean-Square , Parameter Error 
Basic Data Set— Randomly Sampled Examinees 





Sample 




Test Length 




Average 


Parameter 


Size 


20 


35 


50 


65 


a 


500 


.710 


.522 


.368 


.359 


.510 




1000 


.680 


.430 


,341 


Oil Jl 


HAG 




2000 


.735 


.U22 


.305 


.295 


.474 




Average 


.709 


.460 


.339 


. 333 




b 


500 


.242 


2?9 


.212 


.195 


.223 




1 000 


• ■ y j 


.203 


.202 


.185 


.196 




2000 


.261 


.155 


.156 


.163 


.189 




Average 


.234 


.202 


.191 


.182 




c 


500 


.108 


. 101 


.083 


.080 


.094 




1000 


.103 


.088 


.074 


.066 


.084 




2000 


.122 


.074 


.067 


.071 


.087 




Average 


.112 


.089 


.075 


.072 





the corresponding rows and columns. Essentially the same observa- 
tions made regarding the absolute error can be made here regarding 
the root-mean-square errors. 

Characteristics of Asymptotic .'.jlllty Estimates 

Table 13 presents the average absolute error of estimate of 
ability that would be obtained if the calibrated items were admin- 
istered an infinite number of times to an infinitely large standard 
normal population of examinees and were scored using the estimated 
parameters. Entries corresponding to the 12 cells are simple aver- 
ages of this error obtained with five different sets of items. These 
errors are unlike the absolute errors discussed in the previous sec- 
tion in that they refer to asymptotic errors in the estimation of 
ability and not to errors in the item parameters themselves. 

The absolute errors, presented in Table i3, consistently de- 
creased as the test lengths increased and, except for one incon- 
sistent cell, as calibration group size increased. The unit of these 
errors is the same as the standard theta metric and somr comparison 
can be made with absolute errors in the b parameters presented in 
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Table 13. Absolute Asymptotic Ability Error 
Basic Data Set — Randomly Sampled Examinees 



Sample 



Test Length 



Size 


2C 


35 


5 , -* < 


65 


Averag- 


500 


. 170 


.140 




. 107 


.130 


1003 




. 102 


.101 


.093 


.105 


?00^ 




.093 


.035 


.086 


.105 


Average 


.150 


. 112 


.09" 


.095 





Tabl* * m . The errors in trie asymptotic ability estimates were some- 
what smaller tnan those observed with trie o parameters. This is 
Drobatl/ Tue -.0 an averaging effect across " items. An important 
feature f however, is tnat these e^r^rs did no- reach zero a;, 

tes*: lengtn reached infin * 

noot-mean-square erro-s ^~ asymptotic ability estimates a^e . - 
sen r e - 3 - ; a -:^ - . ^argmcL valuer, - th.? tabj.e we-e comDJi* a 
STJa ra root :f mean n r t^e stj^^^i er.t-i*' cn^ corresno^ 

3r:.'jsiors jra^ fro- . 



r ow:. a*n col n-,' 
Drevi>u"; tat^ a can rav 
.owe-* i.-: : c*?.., e r r w v 



-r- c%m- 



Tv^ sh'jw. *-*iati/" -?f f i^ienc**** 



i ■ 
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Table 15. Relative Efficiency 
Basic Data Set—Randomly Sampled Examinees 



Sample 

Size 




Test 


Length 




Averag 


20 


35 


50 


65 


500 


.843 


.306 


.899 


.92" 


.88 '4 


1000 


.363 


.89^ 


.916 


.943 


.90^ 


200 'j 


.813 


.91 * 


.94j 


.952 


.906 


Average 




.89o 


.913 


.94 1 





M- information obtained using the estimated parameters and using tne 
tru* parameters, and then dividing the sum Gained from toe est-- 
"ate- oa-ameters bv the sum obtained from t.'.e true parameters. The 
margma 1 •rficiennes were computed as tne simDie average of tr = 
so—esooniin' row or column efficiencies Averacc item informatic 

implicit y jeight.-.t r.ne constituents o' t-e row averages by *► 



i.e 



nsr*--. -if' x"< n t e: 



jr- <-■, 5 r> n - "ansed r ror a .OW ? hig- . 

-ffici-nc- values -a" o« interore*y ar absout? s-osc: th-y e<r 

, v'""" comDO — i ii-Th-y caUorar<>-: i- sets c' 

v . ni *'^-.»rP-< t, '' examinee,, jn<> aciiitv estimation ccaci*" 
♦ ^ ric- wo;" d« aoout ' sam- /a<: ! f ' r , items „-.t- true oa-ame: - 



_ _ - T — foe** H"! r »~ i c-^d 



1 ? \ . -neS 



: err - , s ^ 3 l i o ^ £ 



'■>-3 v ' ec* .. achieve tne sami> -^psu-e.men- ' "i.: 

...» execs' i" >n t"«- low- ieft ce* , e f i ic lenc .<». 

„. ;; - -f-i«rsqjc r Mi'," inc",ea5i"- r-ps' lenatf- at ", ta^ib- 'lot- . i 
"I. \'~>-<, nnTPStir^ t.nr>*\ jthi? aua. lte" . e^a.uat . -ir.v"-- 

-. SJ , V3 V- an increase te?" length -j-oou-ej a -»L*V-.- 

Irj an ,T., in »ffici»r.cv- fpar. Tid calior^or ?roun 3iz">. "ncv. 
tnan '-ipliig t-e test uengf- fro" to itemr. produce- 
-nan.- -n e^ricien-v .QUI/.^' = 1.1 11,. Ouaaruolm? 

•'al<sraH^n crouo air« from ?'') to examinee- -esult*" it, a-. 

^rp.Vs-^ o-- only ".^-t. ies-. t^an one- four tn tre increas- observe-, r" 
, r J "tp/t.s', iengtn. Tie data from the -anaoralv selectc. exat- 
-'rl^'.n'../ su?^r,t tha* te't length is relatives more importaT 
^alib-ittor. g'rojo siz- determining tn« efficiency of- calibratio- 
- t : en7tn-, ar.-. sampif "iz n 3 «valuat»d n^r" 
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Systematically Sampled Examinees 



Fidelity of Parameter Estimation 

Table 16 presents the parameter bias statistics for Item param- 
eters calibrated on the systematically sampled examinees. The flr^t 
section presents bias of the a parameters. As was observed with Ahe 
randomly sampled examinees, the bias dropped as test length Increased 
and exhibited no definite trend with calibration group size. All mar- 
ginal bias values were about .10 units less than those observed with 
the randomly sampled examinees. This trend continued even as the bias 
values dropped below zero and became negative. 



Table 16. Item Parameter Bias 
Basic Data Set — Systematically Sampled Examinees 





Sample 




Test 


Length 






Parameter 


Size 


20 


35 


50 


65 


Average 


a 


500 
1000 
2000 


.504 
.478 
.462 


.074 
.184 
.223 


.008 
.021 
.017 


-.105 
-.111 
-.084 


.120 
.143 
.155 




Average 


.481 


.160 


.015 


-.100 




b 


500 
1000 
2000 


.090 
.186 
.045 


.298 
.214 
.073 


.207 
.045 
-.067 


.151 
.141 
.175 


.187 
. 147 
.057 




Average 


.107 


.195 


.062 


.156 




c 


500 
1000 
2000 


.042 
.029 
.026 


-.001 
.007 
.009 


.013 
-.013 
-.022 


-.024 
-.021 
-.009 


.007 
.001 
.001 




Average 


.032 


.005 


-.007 


-.018 





Bias In the b parameters exhibited no obvious trend with in- 
creasing test length. This is different from the random-sampling case 
which exhibited a slight decrease. The same slight decrease with re- 
spect to calibration group size was again observed, however. The 
range in bias of the b parameters was somewhat larger In these sam- 
ples. Where the range was from .066 to, .155 in the random samples, 
the range was from -.067 to .298 in these samples. 
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Bias values of the c parameters also had a wider range In these 
samples. Where the random samples had bias values ranging from -.004 
to .033, these samples had value.? ranging from -.022 to .042. The 
slight trend toward less bias observed In the random samples had an 
analog In th? systematic samples; the trend could better be described 
as a trend toward more negative bias, however. Again, no consistent 
trend was observed with respect to calibration group size. 

Table 17 presents the average correlations between true and 
estimated parameters for the systematically sampled calibration 
groups. As with the randomly sampled groups, a slight but inconsis- 
tent increasing trend of the a-parameter correlations with respect 
to test length was observed. No trend with respect to calibration 
group size was obvious, however. The overall magnitude of the a- 
parameter correlations in the systematically sampled groups was 
slightly lower than those observed in the randomly sampled groups. 



Table 17. Parameter Correlations 
Basic Data Set — Systematically Sampled Examinees 



P arameter 
a 



Sample Test Length 

Size 20 35 50 65 Average 

500 .560 .582 .562 .463 .543 

1000 .204 .609 .582 .579 .508 

?000 .355 .601 .709 .664 .596 

Average .383' -597 .622 .574 



b 



500 


.972 


.976 


.987 


.979 


.979 


1000 


.984 


.987 


.986 


.985 


.986 


2000 


.982 


.985 


.990 


.989 


.987 


Average 


.980 


.983 


.988 


.985 





500 


.437 


.360 


.396 


.381 


.394 


1 000 


.448 


.438 


.416 


.396 


.42; 


2000 


.372 


.375 


.421 


.519 


.424 


Average 


.420 


.391 


.411 


.434 





The b-parameter correlations exhibited slight increasing trends 
with resoect to test length and calibration group size. As was ob- 
served in the randomly sampled groups, these trends were inconsistent. 
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The magnitudes of the correlations were slightly lower in the system- 
atically sampled groups. 

No trends were apparent in the £-parameter correlations. Unlike 
those of thfc random samples, no notable increase was observed at a 
test length of 35 or a sample size of 1000. The magnitudes of the c- 
parameter correlations were somewhat lower here than those observed 
in the rando.) samples. 

Average absolute errors of the item parameters for the system- 
atically sampled groups are presented in Table 18. A decreasing trend 
in a-parameter errors with respect to test length was apparent but 
was not particularly consistent. Ho trend was obvious in the a- 
parameter errors with respect to calio»-ation group size. The magni- 
tudes of the errors observed here were about trie same as those ob- 
served m the randomly sampled groups. 



"able -8. Absolute Parameter Error 
iasic Tata 3et — Systematically SamDled Examinees 



Sample ~est Length 



./erage 



"0 ,"*2 ^5 



. /erase . - 14 



»Q6 



. '.31 



l j e a ^ e . • ' - * > ~< 



* 1 



*r-age 



; e : t i . ' " ~ i 1 * *v; t h . \ " ~ *\ r, i s t ° ^ t 1 r t^ r <- 'a • * or < r w ^ * n r * j * *: t f 3 



Si 



the randomly sampled groups where no trend was observed with respect 
to test length but a slight trend was observed with respect to group 
size. The magnitudes of the errors were greater here than in the 
randomly sampled groups. 

The c-parameter errors showed a relatively consistent decreasing 
trend with respect to test length but no consistent trend with re- 
spect to sample size. These findings are similar to those of the 
randomly sampled groups except that a slight trend with respect to 
group size was observed there. Magnitudes of the errors were slight- 
ly higher in the systematically sampled groups. 

T able 1 9 presents the root-mean-square errors of estimate for 
the three parameters, ^s was the case" in analysis of the randomly 
ianpled groups, essentially the same observations made regarding the 
*osol :te error can be made regarding the root-mean-square error. 
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~ible 19, noot-Mean-Square Parameter Error 
^sic Data Set—Systematically Sampled Examinees 

<- Sample It^^Length 

^ ze 20 ^5 5^ 65 Av erage^ 

-00 .5*9 -" 7 3 ■ ^3 - '^1 

'^00 ."72 .*17 . 33B .5*1 .500 

."000 37 -^93 * * ^ 

Average ."10 . .336 .^7 

--00 -330 .3^5 .'405 - ^2 

•:00 .-11 .^39 .OH -15° .370 

-^0«o . i"' 7 .^Sl . ZV -'^ 

. /erage . -vj . > . 

-,10 . V41 ^07 .;06 .193 .112 

• 300 100 .039 • J • "97 

:?oo . ' * i . 1 12 .101 . 090 . .00 

//erase . * 3'* • ' } ^ ■ ^ • r * 
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Characteristics of Asymptotic A bility Estimates 

Table 20 presents the absolute errors of asymptotic ability esti- 
mates for items calibrated using systeroaticallj sampled groups. Un- 
like the corresponding table for the randomly sampled groups, no con- 
sistent trends with respect to test length 'or sample size were ob- 
served. The magnitudes of the errors were consistently larger, how- 
ever. Absolute errors in the randomly sampled groups ranged from 
.085 to .170; in the systematically sampled groups they ranged from* 
.124 to .346. i 



Table 20. Absolute Asymptotic Ability Error 
Basic Data Set — Systematically Sampled Examinees 



Sample J 
Size 




Test 


Length 




Average 


20 


35 


50 


65 


500 


.320 


.336 


.227 


.266 


.287 


1000 


.346 


• 313 


■ .'124 


.215 


.249 


2000 


.225 


.263 


. .137 


-.293 


.229 


Average 


.297 


.304 


.163 


.258 





Similar observations can be made for the root-mean-square errors 
p "sented in Table 21. Ho definite trends were apparent and the mag- 
nitude of tjhe errors wa^ n arr,er than in the randomly sampled groups. 
Root-mean-square errors ranged from .102 to .229 in the randomly Sam- 
pled groups, in the systematically samoled groups they ranged from 
..158 to .466. 1 



Table 21. Root-Mean-Square Asymptotic Ability Error 
Basic Data Set — Systematically Sampled Examinees 



Sample 
Size 




Test 


Length 




Avera 


20 


35 


50 


55 


500 


.366 


.434 


.303 


.330 


.362 


1000 


.466 


.349 


. 158 


.249 


• 327 


2000 


.288 


.305 


.179 


.346 


.286 


Average 


.381 


.367 


.223 


.311 
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Efficiency of Ability Estimation 



Table 22 presents the efficiencies of the items calibrated in 
the systematically sampled groups. The general trends observed in 
the randomly sampled groups were again observed here. In these groups, 
tripling the test length increased the calibration efficiency by 9.8*, 
and quadrupling the calibration sample size only increased the effi- 
ciency by^3.2J. Although the differences were not as pronounced, 
these results corroborated the earlier ones, suggesting that test * 
length is mope important than group size in improving calibration 
efficiency. 



Table 22. Relative Efficiency 
Basic Data Set — Systematically Sampled Examinees 

— _ r — — — 



Sample ; Test Length 



Size 


20 


35 


50 


55 


Average 


500 


.851 


.851 


.904 


.901 


.877 


1000 


.797 


.877 


.910 ' 


.930 


.879 


2000 


.870 


.834 


.930 


.934 


.905 


Average 


.839 


.871 


.915 


.922 





The magnitudes of the efficiencies were approximatcl* equal in 
the two conditions Efficiencies of the randomly sampled ^^oups 
ranged from .818 to .952. Efficiencies of the systematically sampled 
groups ranged from .797 to ,934. It is difficult to say whether the 
slight superiority of the randomly sampled groups waa duo to more 
appropriate ability distributions, all being standard normal, or sim- 
ply to sampling error. 



Selected Examinees 




Fidelity of Pa. *meter Estimatio n 

Table 23 presents bias statistics for the parameters of items 
calibrated on selected samples of examinees. All samples contained 
1,000 examinees, so only ^our cells and their row average are present- 
ed in the table. Bias in che a parameters ranged from -.283 tc -.416. 
A consistent decreasing trend with increasing test length was obvious. 
The bias progressed to a value more negative than observed in either 
of the calibration groups discussed above. 
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Table 23. Item Parameter' Bias 
Basic Data Set — Selected Examinees 



Test Length 

^ Parameter 20 , 35 50 65 Average 

a 416 -.031 -.164 -.28^ 

b -.213 -.459 -.377 -.464 

.145 .123 .095 .075 
— — _ — _ 

The o parameters had a consistent negative bias. This was un- 
doubted iy~ a ue to the fact tnsL tnc selected population had hi^h^r 
ability tnan the stanaari a.-;., 0,1) population assumed by the 
id nation procedure,. No t *~ end witn respect to test length was obser* »*^t . 

Bi3-> jlO the c parameters consistently necreased witr increa.-... 
test length. Tne~oias was considerably hi^n^r tnan that ooserve^ . . 
sorresoondin* taoles fn- the oth^r sot-' ec Av£ <2e D123 fc^ . . 
ran3GTi and systematic sa-noles of 1.00.; examinees *?e**e .0^2 / 
t>~. \ ' were n-ich iowe r t,n-i~ tne. , 1 1 ods^^v'-" ■» 

lasie ore^^ts correl "»t ion.i j^^we^ 1 : t r ^ t~u° a "it esci" 
n?rarie:>p".-i tor t n ^; se Lectei-exafiin^e samcies. No ~r>n^<<i*»r 
w=j ~> > ->e^ve'. i r ? tn^ 3-par attel ^ co r ' r '-l at ions witr, r esp* j *" ~ . : - 
len^tn Mr. tv_> correlations ^ene^aily r oso wsth morer.r- 1 / 
- •r"-^.MD y ii were somew^a^ Low^^ ina" rn-.--* obs a M 1. - 



-.015 
-.373 
.111 



37 



"3 : > 
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The b-parameter correlations exhibited no trend with respect cc 
test length. Tneir average value of .976 was slightly lower thd . 
tnose of .939 and .935 observed for the randomly and systematical,;/ 
selected groups, respectively. 

No trend was apparent m tne c-parameter correlations , eir-ier. 
Their average of .3^2 was lower than the values of .51*1 an; ob- 
served n the two previous calibration grouos . Tnis shouM be ex- 
pcctea, however, oecause the selected gr^ Jp 'in which only tie ~:st 
abi.e two-t-irjo of the examinees were selected 1 provided Tew c^ " 
low-ability examinees needed to accurately estimate trie c parang- - s , 

Table 25 presents average absolute errors o: tne ite:\ pa-*-!- 
ete^i. The a-oa^ameter errors generally decreased as tes\ *e^v* 
increase Trie magnitude o r tne row average was sughtly ni^ner ' 
co^^esooniir.i: row averages for tne *-andoV. 0" syste-iat . : "/ 1/ 



the j p~'- v " -meters ^no^ - 
i*. length. row av^ r - ■ 

' /?rages fo r the rand on 1 - 
nr.r:Mv*Ly . °o and . U 7 / . 

oa^amete^ errors showed ■> 

higher ;;han thos-" 1 of c» ~ 



Si; 
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Table 26. Root-Mean-Square Parameter Error 
Basic Data Set — Selected Examinees 



Test Length 



Parameter 


20 


35 


50 


65 


Average 


a 


.658 


.510 


.403 


.411 


.506 


b 


.459 


.573 


.500 


.568 


.529 


c 


.181 


.158 


.125 


.111 


.146 



Characteristics of Asymptotic Ability Estimates 

Table 27 presents absolute and root-mean-square asymptotic abil- 
ity-estimation errors. Absolute errors showed no trend with respect 
to test length. The average of the row, .580, was considerably larger 
than the averages of .105 and .249 observed in corresponding earlier 
tables. 



Table 27. Asymptotic Ability ~rror 
Basic Data Set — Selected Examinees 



Test Length 

Error 20 35 50 65 Average 



Absolute .499 .633 .558 .630 .580 

Root-Mean-Square .591 .744 , 642 . 75** . 686 



The root-mean-square errors showed an identical lack of trend 
with respect to test length. Similarly, the row average of .686 
was considerably larger than the row averages of .144 and .327 ob- 
served earlier. 

Efficiency of Ability Estimation 

Calibration efficiencies obtained in the selected samples of ex- 
aminees are presented in Table 28. The usual trend with respect to 
test length, observed with other statistics, was again observed. The 
average efficiency, .823, was somewhat lower than the corresponding 



Table 28. Relative Efficiency 
Basic Data Set — Selected Examinees 



Test Length , 

~20 35 50 65 Ave-^ge 

.719 .818 .865 .889 .823 



efficiencies of .90** and .879 observed earlier. This lowered effi- 
ciency cannot be attributed to any particular item parameter because 
all three were less precisely estimated in this calibration sample 
than in the two discussed previously. It was probably due to the com- 
bined effects of poorly estimated c parameters, caused by a paucity of 
low-ability examinees, and fewer appropriate items for ability estima- 
tion at the higher ability levels encountered. This latter effect is 
du«» to limitations of the item pool used but these limitations were 
imposed to reflect reality, and thus the same effect in live-examinee 
item calibrations would be expected. 



Conclusions 



Three general conclusions and an observation can be made from the 
data presented in this section. First, the parameter correlation data 
were, in general, supportive of other studies investigating the calibra- 
tion 'effectiveness of OGIVIA. The b parameters were vary well esti- 
mated and the a and c parameters were less well estimated. The a 
parameters were estimated somewhat better than the c parameters, but 
the difference was not overwhelming. 

The second conclusion is that test length is relatively more im- 
portant to calibration effectiveness than is sample size, at least at 
the test lengths and sample sizes investigated here. This conclusion 
is mildly supported by the fidelity of estimation data but its strong- 
est support comes from the efficiency analyses. The efficiency anal- 
yses suggested that increases in test length are at least three to 
four times as effective in improving calibration efficiency as propor- 
tionate increases in calibration sample sizes. Given tnat total test- 
ing time required to calibrate a set of items is proportional to the 
number- of items multiplied by the number of examinees, this finding 
suggests that, if sufficient items exist, larger numbers of items 
should be calibrated on smaller samples if available total testing 
time is short. 

The third conclusion is that there appears to be little difference 
in calibration efficiency as a function of random vp~;us systematic 



sampling of examinees but a large difference between these and se- 
lected samples of examinees (as defined here). Although some differ- 
ences were observed between random and systematic samples in the fi- 
delity analysis, differences in the efficiencies were trivial and prob- 
ably due to sampling error. Efficiencies observed in the selected 
samples were noticeably lower, however, and wer* probably due to a 
lack of low-ability examinees for c parameter estimation and to a 
:is" ibution of abilities slightly less estimable with available items. 

>i aniition to t^ese conclusions, the parameter oias statistics 
-resented in Tables 16. and 23 .suggest that OGIVH tends to over- 
estimate a parameters -r short test lengths, ^ince the test lengtns 
:e: 'i evaluate the ^ea. ^SV\B data ranged from ?0 to. ~5 items, am 
]'/T* was -jne ^1 .^e estimation methods used , me average a 
"f " * sed to -re^erate icems for the simulations mav nave 'ceen 
" : . . -:an w e :*«n *>on Tables " 1 6 , an: ~ 7 , the amount bv 
- 1 , .^amete^s e v/e '•optima ted depends ;n the net nod by vnion 
e^t 3 i^I^te** -,nd manses ~^om ^n -veresti^^ - .n ,r s ^ ^ 

" "~ _ - - ^ *, »a -3 ' • x ^ i ") 1 ' * j v e ^e s c 1 m 3 ~ e of *" ~ 0 ^ v ^ f ^ 
* _ , * - - "* 1 ^ r 0 ~* a 1 e "** * l : 1 1 e" 1 £ . ~ ^5 1 1 f c " ^ ^ ^ ~ ^ ^ o v •^ r ~ — 
~~ -\ _» *,^» '2 ;set; . , -" . ^a 1 :i:3«: hut r a^~ 

■ " ^ " a r; 1 * z n 1 1 v : 1 -z a , ~ ^ ; 4 ~ p ^ ^ ^ - • - v. n 2 . 
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IV. LINKING WHEN EXAMINEES ARE RANDOMLY SAMPLED 



Linking sets of items administered to randomly sampled examinees 
presented the simplest linking environment investigated in this re- 
search. In this situation, the equivalent-groups, anchor-group, and 
anchor-test methods were all reasonable choices. Given the added 
assumption that items were randomly assigned to forms, usually an easy 
assumption to satisfy, the equivalent-tests method was also an accep- 
table method . 

The basic data set containing randomly sampled examinees was 
used for this portion of the research. Although all four linking 
paradigms were conceptually reasonable to apply, only the equivalent- 
groups and equivalent-tests methods were evaluated. The anchor-group 
and anchor-test linking methods were not evaluated using this data 
set where examinees were randomly sampled from a single population. 
This deletion was done purely for efficiency of analysis. Since these 
methods do not assume randomly samDled examinees, it was reasonable 
to expect that data from the systematic examinee samples would yield 
sufficient data for comparison. Given the reasonableness of this 
expectation and the extensive amount of computer time required to 
analyze those methods, a decision was made not to perform this essen- 
tially duplicate analysis. 



The equivalent-groups and equivalent-tests methods are essen- 
tially the sa" in tenns of the data required. The differences be- 
tween them stem from the different assumptions invoked in obtaining 
trie t^ansformati >n parameters. The two methods have thus, for pur- 
poses of this report, been combined into one section. Although they 
are discussed as separate methods, they share common tables. 

Pro cedure 

Equ ivalent groups . Conceptually, equivalent-groups linking is 
accomplished by finding transformation constants which, when applied 
to the a and b parameters, will make the mean and variance of ability 
in each~group - equivalent . Two transformation constants are required 
to accomplish this. Given that the constants are to be applied in 



Equi va le nce _Me t hod s 



the form: 



a:dk 



[141 



and 



b (e-m)/k 



[151 
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whe. e a and b are the parameters on the "equivalent" metric and d and 
e are the parameters on the unlinked metric, one set of constants 
that will result in a cor n -etric with a mean of zerr and variance 
of one is: 



and 



k = a, [16] 
"i = u r [173 
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where u r and o r are, respectively, the mean and standard deviation of 

ability estimates in the unlinked groups. These values may be readi- 
ly verified by noting that a satisfactory transformation must satisfy 
the equation : 



a(e-b) ^ d(r-e) [18] 



If a and b given in Equations 14 and 15 are substituted into Equation 
IS, gamma can be expressed as a function of k f m, and theta: 



= + "i [191 

Given that theta is to be distributed with mean zero and variance 
one, the constants k and m are obviously the standard deviation and 
the mean of gamma. Thus, the constants in the equivalent-groups meth- 
od are simply the mean and standard deviation of the abilities in the 
unlinked groups. 

In practice, true abilities are not available, however, and they 
must be estimated. If errors of measurement are equivalent in each 
group or adequately compensated for, equivalent-groups linking may be 
accomplished using ability estimates. There are, however, several such 
estimates that may be used. Four methods of estimating ability were 
investigated including two Bayesian and two maximum-likelihood methods. 
In addition to simple means and standard deviations of these estimates, 
robust estimation procedures were applied to the maximum-likelihood 
estimates. This resulted in six methods for determining the equiva- 
lent-groups transformation constants. 

The program OGIVIA uses a modal Bayesian estimate with a stand- 
□M-nop"*! prior ability assumption. The estimates provided by OGIVIA 
we^e basei on an early stage of the program which did not use the final 
item parameter estimates. Proceeding in the spirit of OQIVIA but using 
better parameter ertimates, modal Bayesian ability estimates assuming a 



standard-normal prior were obtained by solving the following equation 
for theta: 



-2 



a exp(x ) 
g 8 



c +exp(x ) 
J 8 



- (1.0 + exp(x )) 



[20] 



whera u = 1 if the item is answered correctly 
g 

= 0 otherwise 



and x g = 1.7 a g (9-b g ) 



The Bayesian estimation procedure assuming a normal prior im- 
plicitly regresses the estimates at finite test lengths. The prac- 
tical effect of this on linking is to bias the linking constants. 
The second estimation procedure incorporated an attempt to correct 
for this regression by progressing the estimation by an amount equiv- 
alent to the suspected regression. This adjustment was accomplished 
by using the Bayesian posterior variance estimate obtained from Equa- 
tion 21 and the Bayesian ability estimate obtained from Equation 20 

~ ~ nr*£4 ■» K t%A in CniiaH P P _ 



q2 = 
B 



-1 + 2.H9 



1 



a exp(x ) 
g g 



- (1.0 + exp(x ))' 

o 



(c +exp( x ) )' 
g S 



[21] 



5 



Pro 



= e B (i- 



■1/2 



[22] 



Another procedure to ameliorate the Bayesian regression is to 
use a maximum-likelihood estimation procedure instead of a Bayesian 
one. The maximum-likelihood procedure attempts to be unbiased and 
does not regress the ability estimates. It has problems, however, 
in that it tends to make some extreme estimates when the test length 
is finite. Individuals answering all items correctly or less than a 
-nance number correctly receive infinite ability estimates. Such 
estimates, in turn, cause some difficulty in calculation of means and 
variances of the ability estimates. Maximum- likelihood estimation was 
used as the third estimation procedure. In most cases, these esti- 
mates were obtained by finding the root in theta of Equation 23: 
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a exp(x ) 
g g 



c +exp(x ) 
g g 



- (1.0 



exp(x )) 
g 



-1 



C231 



In cases where the estimates were beyond plus or minus 3.5, the es- 
timates were artificially bounded at those values. 



The Bayesian procedure was corrected for regression. An attempt 
was made to correct the maximum-likelihood procedure for erring toward 
the extreme. This was accomplished by applying the squared standard 
error of estimate obtained from Equation 24 to the ability estimate 
obtained from Equation 23 by the method prescribed in Equation 25, 



r 

- = \ 2.89 y a%xp(x ) 
& o 



2' 



(c +exp(x )) : 
g g 



[24] 



1 ""i 



- (1.0 + exp(x ; )" 2 



-1 



[25] 



Truncation of the ability estimates at plus and minus 3.5 was 
one method of dealing with extreme ability estimates produced by the 
maximum-likelihood procedure. This method was somewhat arbitrary and 
still used a least-squares weighting scheme within the range. Gen- 
eral procedures of robust estimation were available to deal with 
problems such as these. One of the most popular procedures was the 
AMT sine-transformation procedure (Andrews, Bickel, Hampel, Huber, 
Rogers, * Tukey, 1972; Wainer * Wright, 1980). In th<s procedure,' 
the equation 



> f[( : -T)/S] = 0 [26] 

is solved for T and S where T is the robust estimate of location, S 

is the median absolute deviation from T divided by the constant 1.349, 
and 



fCx] = sin(x/2.1) if -5.597 < x < 6.597 [27] 

and f[x] =0 otherwise. 
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The procedure was iterated adjusting both T and S on each iteration 
until T stabilized within 0.001. 

This robust estimation procedure was applied to the maximum- 
likelihood estimates and the regressed-maximum-likelfhood estimates 
obtained above to produce the fifth and sixth methods of estimating 
the mean and standard deviation of ability. Unlike the first four 
methods, the robust techniques were not methods of estimating ability 
but rather methods of obtaining leans and standard deviations of es- 
timates. The means and standard deviations were the only elements 
used for linking, however, and these robust procedures thus produced 
two more methods of equivalent-groups Unking. It should be noted 
that the robust techniques were applied to the truncated maximum- 
likelihood estimates and not to estimates permitting infinite values. 

rquival°nt tests. The equivalent-tests method assumes that the 
item ^im^e7"disTrTbutions of tbi tests being linked are ^ uivalent ; 
Linking, under this assumption, is accomplished by jetting the a and b 
parameters to common values in each of the tests. Practically .these 
values can be any values desired. To aid in interpretation of tl * 
r^ a m y and asymptotic characteristic statistics, these common values 
* ;i';; t to tne true means obtained in the simulation reported in the 
design section of this report, 1.536 and 0.227 for a and b, respec- 
tively. This was accomplished by computing transformation parameters 
k and m as follows: 

k. = 1 .586/ u d 



- 0.360)/ u d 



[281 
[29] 



where , and „ are th- means of the a and b parameter estimates in 

d c 
each test prior to linking. 



Results 



The magnitude of the amount of data generated by this project 
made it unreasonable to present all analyses in the body of this re- 
port. To meaningfully present the analyses done, individual tables 
are presented in the Technical Appendix and summary tables are pre- 
sented here in the text. For the homogeneous linking evaluation n 
which linking was done separately in each of the 12 cells, 12 indi- 
vidual tables are presented for each of the three classes of analyses 
in the Technical Appendix. One composite table is presented in the 
body of the report for each class of analysis. For the heterogeneous 
linking evaluation where five replications pooling 20 items from e ach 
cell were done, five individual tables for each class of analysis are 
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presented in the Technical Appendix, and one is presented in the body 
of the report. 

Fidelity of parameter estimation . Table 29 presents fidelity-of- 
parameter-estimation statistics for eight linking methods in the homo- 
geneous condition. The first six methods correspond to different meth- 
ods of determining the linking constants within the equivalent-groups 
method. The seventh is the equivalent-tests linking method. The 
"no-linking" method is included as a baseline of comparison in which 



Table 29. Item Parameter Error—Equivalence Methods 
Homogeneous Condition Using Randomly Sampled Examinees 
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True Bias in Absolute RMS 

Method Mean SD Mean SD Error Error 



Bayesian 
a 
b 

Progressed Bayes 
a 
b 

Max. Likelihood 
a 
b 

Regressed M.L. 
a 
b 

Robust M.L. 

3 

b 

Rob. Reg. M.L. 
a 
b 

Equivalent ,ts 
a 
b 

No Linking 
a 



1.591 .482 -.020 .018 . 3^4 .h69 .591 
.221 1.329 .088 .31 1 .293 . 1 »25 .987 



1.591 .482 .041 .036 .359 .484 .581 
.221 1.329 .072 .250 .255 .370 .987 



1.591 .492 .344 .125 .527 .693 .576 
.221 1.329 .023 .019 .171 .234 .987 



1.591 .482 .223 .088 -454 .605 .576 
.221 1.329 .036 .106 .190 .^64 .987 



1.591 .482 .263 .112 .473 .616 .578 
.22 1 1.329 .043 .076 .186 .271 .986 



1.591 .482 .202 .093 .435 .572 .579 
.221 1.329 .048 .121 .198 .295 .986 



1.591 .482 -.006 .015 .337 .456 .577 
.221 1.329 .005 .275 .358 .487 .974 



1.591 .482 .236 .091 .453 .596 .581 
b .221 1 . 329 .110 .087 .198 .268 .987 
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the parameters were taken directly from OGIVIA with no explicit trans- 
formation. In fact, this procedure approximates an equivalent-groups 
linking method because OGIVIA, in an early stage of calibration, sets 
its best estimates of the mean and varience of ability to zero and one. 

The first column presents the means of the true a and b param- 
eters for all cells in the data set. To compute the values in the 
first column, means of parameters for all items in a cell were com- 
puted for that oell. This included all items in the five calibration 
groups. The mean of these 12 cell means was then computed for the 
entry in Table 29. The means of the a and b parameters, 1.^91 an1 
.221, were quite close to the means obtained in independent simulation 
(discussed with the analysis of the basic data sets) of 1.586 and 
.227. 

The standard deviations presented in column two were computed as 
the square root of the mean variance averaged in the same manner as 
the means of column one. The averages of 0.U82 and 1.329 were, again, 
very close to those obtained in simulation, 0.488 and 1.338. - 

Biases presented in columns three and four were computed as the 
linked value minus the true value for both means and standard devia- 
tions. Mean biases were computed for items in each of the 1 2 cells. 
Table 29 presents the means of these 12 cell means. 

Absolute error was computed for each oell as the mean of the ab- 
solute deviations of linked from true item parameters for all items 
in a cell. Table- 29 presents the simple average of these means over 
all 12 cells. 

Root-mean-square error was calculated for each cell in a manner 
similar to that of absolute error. The squared deviations were aver- 
aged (rather than the absolute deviations), and the square root of the 
resultant mean was taken. The RMS error presented in Table 29 is the 
square root of the mean of the squared individual cell values. 

Correlations between true and estimated parameters were computed 
in each of the 12 cells. An r-to-z average of the oell valuer was 
then taken for each entry in Table 29. 

Compared in terms of bias, the equivalent- tests method of link- 
ing produced estimates closest in mean a and mean b. It also produced 
estimates with the least bias in standard deviation of a. Several 
methods had superior estimates in terms of standard deviations of the 
b parameters, however. 

The equivalent-tests method was again superior when absolute' 
error in the a parameters of th< various methods, was^considered . 
Equij^al^-sro^ of the Bayesian procedures 
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werw nearly as good. When t> parameters were considered, the maximum- 
l*Kelihood procedures appeared to produce less absolute error than 
the other methods. 

Root-mean-square error comparisons produced the same findings: 
the equivalent-tests method was superior in estimation of the a^ param- 
eters with the Bayesian equivalent-groups methods close behind. The 
maximum-likelihood equivalent-groups methods produced the best esti- 
mates of the b parameters. 

Correlational analyses showed the Bayesian and no-linking pro- 
cedures to produce the best-linked a parameters. The maximum-likeli- 
hood procedures did nearly as well. The equivalent-tests method pro- 
duced ^-parameter correlations about as high as those of the maximum- 
likelihood methods. The b-parameter correlations were nearly con- 
stant at .986 to .987 for all but the equivalent-tests method, which 
produced a correlation of only .974. 

Table 30 presents fidelity statistics for the heterogeneous link- 
ing condition containing pooled results of five replications sampling 
20 items from each cell. Again, all entries are summary statistics of 
several individual tables contained in the Technical Appendix, In this 
case each entry represents pooled results of five replications rather 
than of 12 cells. The columns of the table all correspond to those of 
Table 29, and the pooling, in each case, was done in tha same manner. 

The means and standard deviations presented in the first two 
columns were again close to the true values found in tne independent 
simulation. That they were slightly different is due to the fact that 
only the first 20 items in each calibration group were used for the 
heterogeneous analysis. Thus, less than half of the items included in 
the homogeneous analysis were used in this analysis. 

The bias data in columns three and four presented essentially the 
same picture as the bias data in Table 29. Similarly, identical obser- 
vations could be made regarding the absolute and root-mean-square error 
data of columns five and six. This similarity is more an artifact than 
a discovery, however, as neither the biases nor the errors are affected 
by composition of the item sets. The fact that they differ at all is 
due to fluctuations caused by item sampling. 

The change in composition was expected to affect the correlations. 
Different test lengths and calibration group sizes do produce different 
biases in linking constants. The different biases shift items of the 
different cells differentially and this affects the correlations among 
the parameters. Marked changes from T-able ?9 occurred in Table 30. 
Where Table 29 showed ^-parameter correlations closely clustered in 
value, the ^-parameter correlations presented in Table 30 had a rela- 
tively wide range of values. Furthermore, the equivalent-tests method, 
which produced the lowest correlation in Table 29, produced the highest 
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Table 30. Item Parameter Error— Equivalence Methods 
Heterogeneous Condition Using Randomly Sampled Examinees 



Method 



True 



Bias in Absolute RMS 



Mean SD Mean SD Error Error 



Bayesian 
a 
b 



1.588 .490 -.014 .038 .348 .470 .580 
.248 1.350 .090 .315 .295 .431 .983 



Progressed Bayes 
a 
b 



1.588 .490 .047 .060 .363 .487 .577 
.248 1 . 350 .073 .253 .258 . 375 .984 



Max. Likelihood 
a 
b 



1.588 .490 .350 .202 .532 .698 .529 
.248 1 . 35C .020 .018 . 175 . 240 .985 



Regressed M.L. 
a 
b 



1.588 .490 .229 .152 .459 .610 .535 
.248 1 . 350 . 035 . 107 . 194 . 271 .985 



Robust M.L. 
a 
b 



1.588 .490 .270 .157 .478 .622 .548 
.248 1.350 .042 .078 .191 .279 .983 



Rob. Reg. M.L. 
b 



1 . 588 . 490 .209 . 130 . 44 1 .577 . 557 
.248 1 . 350 .047 .125 . 204 . 303 .983 



Equivalent Tests 
a 
b 



1.588 .490 .001 .032 .340 .459 .596 
.248 1.350 .008 .277 .361 .491 .964 



No Linking 
a 
b 



1.588 .490 .242 .144 .458 .600 .553 
.248 1.350 .108 .086 .200 .273 .986 



in Table 30. With the exception of this method, the a-parameter corre- 
lations were lower in Table 30 than in Table 29. The b-parameter cor- 
relations lost some of the uniformity they exhibited in Table 29 but 
the sane general conclusions could be drawn. The equivalent-test3 
method was still inferior in terms of b-parameter correlations. 

Charac terist ics of asymptotic ability estimates . Table 31 pre- 
sents -statistics descriptive of linking and calibration errors on 
asymptotic estimates of ability in the homogeneous condition. The 



Table 31. Asymptotic Ability Estimates— Equivalence Methods 
Homogeneous Condition Using Randomly Sampled Examinees 



Method 


Mean 


SD 


Absolute 
Error 


RMS 
Error 


R 


Bayesian 


.004 


1.073 


.064 


.098 


.999 


Progressed Bayes 


.001 


1.035 


.043 


.072 


.999 


Max. Likelihood 


-.002 


.890 


.100 


.140 


.998 


Regressed M.L. 


-.005 


.945 


. Ooo 


i on 
. 1 UU 




Robust M.L. 


.002 


.915 


.079 


.111 


.999 


Rob. Reg. M.L. 


-.003 


.944 


.061 


.088 


.999 


Equivalent Tests 


-.086 


1.066 


.151 


.209 


.998 


No Linking 


.074 


.934 


.100 


.125 


.999 



values in uhe table were compiled from corresponding values in '2 
cells. The means and absolute errors in Table 31 represent simple 
averages of the cell values. The standard deviations and root-mean- 
square errors were computed as the square root of the mean squared 
values from the individual tables. The correlations were computed as 
the r-to-z average of the individual correlations. 

The means, presented in the first column, were all fairly close 
to the true value of zero. The means produced by the six equivalent 
groups methods were all somewhat closer than the means produced by 
the equivalent-tests method or by no linking. The standard devia- 
tions were near the true value of 1.0 but were, typically, not as 
close as the means had been. The most deviant was the maximum-like- 
lihood equivalent-groups procedure. The least deviant was the pro- 
gressed-Bayesian equivalent-groups procedure. 

Columns three and four present absolute and root-mean-square 
errors of the asymptotic estimates. The eight linking procedures 
ranked essentially the same in the two columns; the absolute errors 
produced a tie and the root-mean-square errors did not. The pro- 
gressed-Bayesian equivalent-groups procedure produced the least 
error. The equivalent- tests procedure produced the most, more than 
the no-linking condition. Except for the equivalent-test* method, 
all methods (including no-linking) produced lower errors in asymptotic 
estimates than were produced by the unlinked individual calibrations 
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summarized in Tables 13 and 14. Average values in those tables for 
absolute and root-mean-square error, respectively, were .113 and .153. 
The ob^ rvation that error in the no-linking condition decreased was 
apparently due to a better averaging of parameter errors when all 
five calibration groups within a cell were combined. 

The correlations between true and asymptotic ability estimates 
wre so high as to be uriinformative about linking adequacy of the 
various methods. All were witnin .002 of unity and, although the 
maximum-likelihood equivalent-groups and the equivalent-tests methods 
were slightly inferior, this difference may have been due to accentu- 
ation of trivial differences incurred in rounding. 

Table 32 presents asymptotic error statistics for the hetero- 
geneous condition. Again, all values are summary values and were 
prepared, in the same manner as Table 31. from five replications, each 
of which sampled 20 items from each of the 12 cells. The first two 
columns, those of the mean and standard deviation, were essentially 
unchanged from Table 31. The only difference was a slight tendency 
toward more extreme deviations of the standard deviations from 1.0. 
The two Bayesian methods were exceptions to this, in that they were 
slightly less deviant than in the homogeneous condition. 

Table 32. Asymptotic Ability Estimates— Equivalence Methods 
Heterogeneous Condition Using Randomly Sampled Examinees 



Absolute RMS 
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Method 


Mean 


3D 


Error 


Error 


R 


Bayesian 


.006 


1 .064 


.059 


.084 


.999 


Progressed Bi yes 


.003 


1 .025 


.037 


.059 


.999 


Max. Likelihood 


.002 


.870 


.108 


.139 


.999 


Regressed M.L. 


-.001 


.927 


.064 


.089 


.999 


Robust M.L. 


.004 


.904 


.081 


.110 


.999 


Rob. Reg. M.L. 


-.000 


.933 


.059 


.085 


.999 


Equivalent Tests 


-.087 


1 .075 


.100 


.143 


.998 


No Linking 


.076 


.919 


. 100 


.123 


.999 




100 


-95- 









The absolute and root-mean-square errors showed some changes 
from the preceding table. The ordering of methods by the two statis- 
tics was not identical in Table 12. The Bayesian methods were still 
superior to all other methods. The equivalent-groups method improved 
to a point where it was nearly as good as no linking and, depending 
on the type of error, slightly better or slightly worse than the 
maximum-likelihood method. 

The correlations presented in the fifth column were, again, par- 
ticularly uninformative. Only one, that corresponding to the equiva- 
lent-tests method, showed any departure from the nearly perfect .509. 

Efficiency of ability estimation . Table 33 presents efficiency 
data for the homogeneous linking condition. The first column con- 
tains the average item information produced in several ways. The 
first entry, indicates the information available in the average item 
using true parameters. The second entry indicates information avail- 
able using estimated parameters and (hypothetical) perfect linking. 
The remaining entries in the first column indicate information avail- 
able from items using parameters linked in various ways. 



Table 33. Efficiertcy Analysis—Equivalence Methods 
Homogeneous Condition Using Randomly Sampled Examinees 



Average _Efflcl ency Relative to 
Item True Estimated 
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Method 


Information 


Parameters 


Parameters 


True Parameters 


• 319 






Est. Parameters 


.287 


.898 




Bayesian 


.284 


.893 


.989 


Progressed Bayes 


.284 


.988 


.933 


Max. Likelihood 


.284 


.999 


.989 


Regressed M.L. 


.284 


.998 


.989 


Robust M.L. 


.284 


.993 


.989 


Rob. Reg. M.L. 


.284 


.899 


.938 


Equivalent Tests 


.276 


.864 


.962 


Mo Linking 


.284 


.397 


.999 




-96- 
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Information from the true parameters was calculated separately 
in each of the individual calibration groups in each of the 12 cells 
using true parameters. The individual information values were then 
averaged to produce the value, .319, in Table 33. The information 
from the estimated parameters (the second entry) was obtained in the 
same way except that estimated parameters rather than true parameters 
were used. Since the computations were done within individual cali- 
bration groups, linking had no effect on the values. 

The remaining values in the first column were obtained by pool- 
ing all items in each cell after the linking transformations were 
applied. The essential difference between these values and the in- 
formation from the estimated parameters (i.e., the second entry) was 
that these values were obtained from a pool of all items in each cell 
rather than from each calibration group individually. The entries 
presented in Table 33 are simple averages of the corresponding en- 
tries in the 12 individual cell tables. 

Efficiency relative to true parameters shown in column two was 
calculated directly from the values in column one of the table. Each 
value presented in column two is the corresponding value in column 
one divided by .319. Efficiency relative to estimated parameters 
was calculated similarly except that column one values were divided 
by .287. All columns in Table 33 present essentially the same data 
from a different viewpoint. 

The efficiencies relative to estimated parameters provide data 
most directly relevant to comparisons of linking methods. These values 
can be interpreted as an index of linking efficiency. The information 
available from the estimated parameters calculated within individual 
calibration groups represents efficiency of calibration free of linking 
errors. Any degradation from that point, as items from several cali- 
bration groups are pooled, represents errors due to linking. 

The efficiencies relative to estimated parameters suggest that 
there is very little difference among most linking methods in this 
condition. The notable exception is the equivalent-tests method. 
Where all other linking methods, including no-linking, had efficien- 
cies of .988 or .989, the equivalent-tests method had a linking effi- 
ciency of only .962. 

Table 34 presents efficiency statistics for the heterogeneous 
linking condition. All statistics were calculated in essentially 
the same manner as before. The primary difference was that the en- 
tries were computed as the average of five replication averages rather 
than as the average of 12 cell averages. 

The information values for the true and estimated parameters 
changed very little from those ot Table 33. The slight changes were 
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Table 3**. Efficiency Analysis— Equivalence Methods 
Heterogeneous Condition Using Randomly Sampled Examinees 



Average Efficiency Relative t o 
Item True Estimated 

Method Information Parameters Parameters 



True Parameters 


.317 






Est. Parameters 


.285 


.901 




Bayesian 


.278 


.876 


.973 


Progressed Bayes 


.277 


.876 


.972 


Max. Likelihood 


.273 


.861 


.955 


Regressed M.L. 


.273 


.863 


.958 


Robust M.L. 


.276 


.370 


.965 


Rob. Reg. M.L. 


.276 


.872 


.967 


Equivalent Tests 


.269 


.850 


.9U4 


No Linking 


.271 


.865 


. .960 



due to the fact that only about half of the items on which Table 33 
was based were used in computing the statistics of Table 34. 

Marked changes in linking efficiency were noted, however. All 
methods, without exception, were less efficient in the heterogeneous 
condition. Differences among the methods were also more obvious. 
The two Bayesian methods were the most efficient. The robust maximum- 
likelihood procedures were next, followed by the no-linking method 
and the maximum-likelihood procedures. The equivalent-tests method 
was again the least efficient of all. 

Table 35 presents linking efficiencies of the Bayesian equiv- 
alent-groups linking r ethod for each of the 12 cells arranged by test 
length and sample size. The Bayesian procedure was singled out for 
this breakdown because it appeared, from data just presented, to be 
one of the best equivalent-groups linking procedures. Linking effi- 
ciency was chosen as the single statistic to be explored in this 
fashion because it seemed to beat summarize the data to answer the 
question of which linking method allowed the best ability estimation. 
Individual cell entries in Table 35 were computed by taking the ratio 
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Table 35. Cellwise Efficiency Analysis 
Bayesian Score — Randomly Sampled Examinees 



Sample 
Size 


20 


Item 
35 


Set Size 
50 


65 


Average 


500 


.968 


.991 


.991 


.959 


.977 


1000 


.984 


.990 


.993 


.996 


.991 


2000 


.972 


.993 


.992 


.996 


.988 


Average 


.975 


.991 


.992 


.934 





of the information values of the linked parameters to the information 
values of the estimated parameters calculated within individual cal- 
ibrations. The marginal values presented are simple averages of the 
corresponding row and column values. They are not pooled values as 
were those in Tables 33 and 34 which were computed as ratios of aver- 
aged information values rather than averages of efficiencies. 

No obvious relationships between linking efficiency and either 
test length or calibration sample size were observed. No trends were 
apparent, even in the marginal values. No interactions were appar- 
ent in the individual cell averages. 

Table 36 presents a similar breakdown of the equivalent-tests 
method efficiencies. The marginal averages exhibited a definite 
increasing trend with increasing test length. This trend was not par- 
ticularly consistent in the individual cell values, however. The 

Table 36. Cellwise Efficiency Analysis 
Equivalent Tests Randomly Sampled Examinees 



Sample Test Length 

Size 20 35 50 65 Average 



500 .916 .935 .97'« .966 .960 

1000 .972 .930 .973 .986 .965 

2000 .928 .961 .961 .982 .958 

Average .939 .959 .969 .978 
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trend was apparent at sample sizes of 2,000 but not at 500 or 1,000. 
Mo relationship between efficiency and sample size^was apparent in 
Table 36. n 

Discussion 

Three sets of analyses have been presented. The fidelity analy- 
ses provided no conclusive evidence regarding which linking proced- 
ure was most effective. Data relevant to this were weak and con- 
flicting. Methods most effective in linking a parameters were not 
the ones most effective in linking b parameters. There was no way to 
determine in any practical way whether a or b errors were more del- 
eterious in regard to ability estimation. 

The asymptotic estimation analysis was somewhat more helpful in 
tha. the joint effect of parameter errors on ability estimation could 
be observed. These data suggested that the two Bayesian linking pro- 
cedures and the robust-regressed maximum-likelihood procedu^s were 
somewhat more effective than the others and that the equivalent-tests 
method was typically no better than the no-linking method. 

Efficiency analyses suggested that whatever differences there 
were among the methods, they were quite small. Efficiency loss due 
to linking error was always less than loss due to calibration error, 
considerably less in some cases. In the worst case of linking error, 
information lost to linking was half as great as that lost to cali- 
bration. For the best linking methods, information loss due to link- 
ing was 10* to ?0* as large as that due to calibration, depending on 
the conditions. 



Conclusions 



Two general linking methods, the equivalent-groups and the equi- 
valent tests methods, were evaluated and compared to each other and 
to a no-linking control method. These comparisons we^e done in both 
a homogeneous linking condition, where the items linked were calib- 
rated in tests of the same length using examinee samples of equal 
size, and in a heterogeneous condition of mixed test lengths and 
sample sizes. Several conclusions can be drawn from these data. 

First, the equivalent-groups methods were generally superior to 
the equivalent-tests method. In some analyses, reported in the fi- 
delity of estimation section, the equivalent-tests method appeared to 
be superior. In the more readily interpretable asymptotic-estimate 
and efficiency analyses, the equivalent-tests method was consistently 
one of the poorer linking procedures. 
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Second, of the six equivalent-groups procedures evaluated, the 
ones based on the Bayesian scores appeared to be slightly superior to 
the others. This superiority was apparent only in the heterogeneous 
linking condition, however. In this condition a slight superiority 
was observed in the asymptotic estimation and efficiency analyses. 
Little difference among equivalent-groups procedures was observed 
in the homogeneous condition although the Bayesian methods had 
slightly less error in the asymptotic estimates than did some of the 
other procedures. 

Third, it should be noted that the no-linking method worked 
reasonably well in these analyses. Although the other procedures 
produced slightly more efficient linking, relatively little effic- 
iency would be lost, under the sampling characteristics present here, 
if the parameters were used as produced by OGIVIA with no explicit 
linking done. 

Finally, although definite relationships between calibration 
efficiency and test length and sample size were shown in a previous 
section, no such relationships were found with respect to linking 
efficiency. This is counter-intuitive because all equivalence methods 
are dependent on sampling error which is dependent on sample size. 
Lack of any relationships may have been due to the fact that the range 
of sample sizes was too small to produce them. To the extent that 
this range covers the range of interest, however, the conclusion of no 
differences can reasonably be applied. 



V. LINKING WHEN EXAMINEES ARE SYSTEMATICALLY SAMPLED 



Linking with examinees systematically sampled represented an ex- 
treme case of violation of the assumption of random sampling essen- 
tial to the equivalent-groups linking method. Only the equivalent- 
tests and the anchor methods were theoretically appropriate for this 
environment. Research reported in the previous section had shown the 
equivalent-groups method to be superior tc the equivalent-tests method 
when the random-sampling assumption was satisfied. Thus, although it 
was not theoretically appropriate for this environment, the equivalent- 
groups method was evaluated to determine if it was practically accept- 
able. 

The basic data set containing systematically sampled examinees 
was used for this portion of the research. For each calibration, an 
AFEES group was selected at random from the 65 available, and exam- 
inees were selected from that g.v^p. These data were then Pjed in a 
manner similar to the data of the randomly sampled examinees. 



Equivalence Methods 



Procedure 

The data used in this portion of the research differed from those 
reported in the previous section. The linking procedures used to im- 
plement the equivalent-groups and equivalent-tests methods did not dif- 
fer, however. All six methods used for determining linking constants 
for the equivalent-groups method were again evaluated. The same link- 
ing transformation equations were again applied to both the equivalent- 
groups and the equivalent-tests methods. 

Results 

Fidelity of parameter estimation . Fidelity-of-estimation sta- 
tistics for the homogeneous condition with systematically sampled ex- 
aminees are presented in Table 37. True means and standard devia- 
tions, shown in the first two columns, were close to the population 
values. The mean of the b parameter, .262, was somewhat more deviant 
from the population value of .227 than the value observed in the pre- 
vious section. All four values appeared to be well within the limits 
of sampling variation, however. 

Bias in the estimated parameters is described in columns three 
and four. The Bayesian equivalent-groups methods tended to under- 
estimate the a parameters. The maximum-likelihood procedures and 
the robust-maximum-likelihood procedures tended to overestimate the 
a parameters, although this was less the case with the non-robust 
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Table 37. Item Parameter Error— Equivalence Methods 
Homogeneous Condition Using Systematically Sampled Examinee 



Method 



Bayesian 



Progressed Bayes 
a 
b 

Max. Likelihood 
a 
b 

Regressed M.L. 
a 
b 

Robust M.L. 
a 
b 

Rob. Reg. M.L. 
a 
b 

Equivalent Tests 
a 
b 



True Bias in Absolute RMS 

Mean SD Mean SD Error Error 



1.588 .501 -.159 -.012 . 374 .519 .533 
b .262 1.3W - 173 .572 .568 .759 .971 



1.588 .501 -.099 .008 .376 .517 .533 
.262 1.3M .147 .495 .512 .682 .971 



1.588 .501 .212 .111 .499 .674 .531 
.262 1.344 .0U6 .188 .333 -423 .970 



1.588 .501 .088 .073 .439 .596 .530 
.262 1.344 .077 .295 .388 .493 .971 



1.588 .501 .194 .106 .470 .623 .529 
.262 1.344 .054 .191 -334 .431 .970 



1.588 .501 .107 .077 .425 .566 .531 
.262 1.344 .079 .269 .375 .489 .971 



1.588 .501 -.003 -034 .371 .510 .526 
.262 1.344 -.035 .340 .417 .587 .971 



Mo Linking ^ , ___ . 

a ^ 1.588 .501 .139 .084 .450 .602 .533 

b .262 1.344 .130 .237 .364 .464 .971 



regressed procedure. The equivalent-tests procedure produced little 
bias in the a parameters. No-linking resulted in overestimation of 
a parameters'7 Slight bias in the b-parameter means was produced by 
the two Bayesian procedures. The no-linking procedure produced a 
similiar amount of bias. The other procedures all produced somewhat 
less bias. 
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In terms of bias in parameter standard deviations, the Bayesian 
procedures produced th* least bias for the a parameters. The maxi- 
mum-likelihood procedures and the no-linking procedure produced the 
most bias in the a-parameter standard devi Mons. These observations 
essentially reversed vrhen the b-parameter oias was considered; the 
Bayesian procedures, produced the greatest bias, and the maximum-like- 
lihood and no-linking procedures produced the least. 

When the biases in columns three and four of Table 37 are com- 
pared to corresponding values for the randomly sampled examinees 
presented in Table 29, several things may be noted: The tendency of 
the maximum-likelihood and no-linking procedures to overestimate the 
a parameters was observed in both tables; biases in b-parameter means 
and a-parareeter standard deviations were similiar in both tables; and 
the bias.es in the b-parameter standard deviations were somewhat larg- 
er in Table 37. ^ 

Absolute and root-mean-square errors of parameter estimation are 
presented in columns five and six of Table 37. The equivalent- tests 
methoo produced the least parameter error, evaluated b> either sta- 
tistic, for the a parameters. The two Bayesian meth >ds wore nearly 
as good, however. The maximum-likelihood and no-lirxing procedures 
produced the greatest amount of £- parameter error. The least b-param- 
eter en or was produced by the maximum-likelihood methods; the most 
was produced by the Bayesian methods. 

Error in the a parameters observed in Table 37 was similar in 
magnitude to that observed in Table 29. Absolute errors of the a 
parameters ranged from .337 to .327 in Table 29; in Table 37 the" 
comparable range was from .371 to .499. Error in the b parameters 
was somewhat greater in Table 37, however. Absolute errors of the b 
parameters ranged from .171 to .358 in Table 29; in Table 37 they 
ranged from .333 to .568. 

Correlations between true and estimated a parameters, shown in 
column seven, were very similar for all linking methods. The Baye- 
sian, the robust-regressed maximum-likelihood, and the no-linking 
procecures were best, with correlations of .533. The equivalent- 
tests method was worst, with a correlation of .526. Correlations 
for the b parameters were almost uniformly .971. The exception was 
the maximum-likelihood procedure, with a correlation of .970, a 
trivial difference. 

Compared to correlations in Table 29, these correlations were 
somewhat lower. It is difficult to say whether this was due to cali- 
bration or to linking errors. Both a- and b-parameter correlations 
were lower in analysis of the current basic data set, however, so the 
drop was probably due to greater calibration error. 
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Table 38 presents fidelity-of-calibrarlon data for the hetero- 
genenous condition. Means and standard deviations of item parameters, 
shown in columns one and two, were essentially the same as for the 
homogeneous condition. Differences were due to the fact that less than 
half of the items used in the homogeneous condition were used here. 

Parameter bias statistics, shown in columns three and four, 
were essentially unchanged from the homogeneous condition. Changes 
in biases of the a-parameter means were in the third decimal place. 

Table 38. Item Parameter Error— Equivalence Methods 
Heterogeneous Condition Using Systematically Sampled Examinees 



Method 



True Bias in Absolute RMS 

Mean SD Mean SD Error Error R 



Bayesian 
a 
b 

Progressed Bayes 
a 
b 

Max. Likelihood 
a 
b 

Regressed M.L. 
a 
b 

Robust M.L. 
a 
b 

Rob. Reg. M.L. 
a 
b 



1.586 .500 -.159 -.005 .377 .521 .511 

.281 1.371 .191 .593 .576 .766 .966 

1.586 .500 -.100 .018 .379 .519 .507 

.281 1.371 .166 .512 .519 .688 .967 

1.586 .500 .210 .186 .505 .676 .157 

.281 1.371 .062 .197 .335 .173 .970 

46 .500 .087 .122 .111 .598 .169 

.281 1.371 .095 .305 .392 .196 .971 

1.586 .500 .192 .138 .173 .622 .191 

.281 1.371 .068 .198 .331 .127 .970 

1.586 .500 .106 .095 .123 .567 .505 

.281 1.371 .091 .280 .376 .188 .968 



Equivalent Tests _„ r „ c 

q a 1.586 .500 -.005 .029 .370 .50? .526 

.281 1.371 -.016 .361 .121 .589 .$55 



a 
b 



No Linking 
a 
b 



1.586 .500 .138 .127 .155 .601 .181 
.281 1.371 .116 .216 .3*.'8 .166 .971 
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Changes in the biases of the b-parameter means were in the second 
decimal place. Changes in the bias of the a- and b-parameter stan- 
dard deviations were somewhat greater, but almost all were in the 
second decimal place. 

The ranges of parameter errors shown in columns five and six 
were essentially unchanged from the homogeneous condition. Similar- 
ly, the linking procedures producing the least error were unchanged; 
the equivalent- tests method produced the least error in the a param- 
eters and the maximum-likelihood procedure produced the least error 
in the b parameters. 

The magnitude of the a-parameter error showed no apparent change 
from that observed in the data set containing randomly sampled exam- 
inees. The b-parameter error increased, however. These trends are 
similar to those of the homogeneous condition. 

Correlations between true and estimated parameters generally 
showed a decrease from corresponding values in the homogeneous con- 
dition. This decrease was most pronounced for the a parameters. The 
highest a-parameter correlation was produced by the equivalent-tests 
method. This was followed by the Bayesian methods. The maximum- 
likelihood and no-linking methods produced the highest b-parameter 
correlations; the equivalent-tests methods produced the lowest. 
Where differences were trivial in the homogeneous condition, correl- 
ations ranged from .956 to .971 in the heterogeneous condition. 

Characteristics of asymptotic ability estimates . Table 39 pre- 
sents asymptotic ability estimate statistics ^or the homogeneous case 
of linking with systematically sampled examinees. The mean asymp- 
totic ability was close to zero for most methods, but more different 
from zero than was observed with the randomly sampled examinees. The 
no-linking procedure produced estimates whose means were closest to 
zero; the equivalent-tests method produced estimates whose mean was 
farthest from zero. The regressed-maximum-likelihood procedure pro- 
duced asymptotic estimates whose standard deviation was closest to 
1.0; the Bayesian procedures produced estimates with the greatest 
bias in the standard deviation. 

Absolute and root-mean-square errors are presented in columns 
three and four in Table 39. The smallest amount of error was produced 
by the regressed and the robust-regressed maximum-likelihood proce- 
dures; the largest error was produced by the equivalent-tests proce- 
dure. The remaining maximum-likelihood and the no-linking procedures 
produced errors slightly greater than the regressed and robust-regressed 
procedures. The Bayesian procedures produced error in an amount nearly 
midway between the maximum-likelihood procedures and the equivalent- 
tests procedure. This ordering of procedures was somewhat different 
from that observed in the set of randomly sampled examinees. 
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Table 39. Asymptotic Ability Estimates-Equivalence Methods 
Homogeneous Condition Using Systematically Sampled Examinee, 









Absolute 


ttno 




Method 


Mean 


SD 


Error 


Error 


R 


Bayesian 


-.044 


1.152 


.167 


.223 


.996 


Progressed Bayes 


-.049 


1 .108 


.115 


.192 


.996 


Max. Likelihood 


-.060 


.944 


.128 


.176 


.996 


Regressed M.L. 


-.051 


1.003 


.121 


.159 


.996 


Robust M.L. 


-.064 


.936 


.127 


.171 


.996 


Rob. Reg. M.L. 


-.060 


.973 


.12 1 


.159 


.996 


Equivalent Tests 


-.200 


1.022 


.211 


.356 


.996 


No Linking 


.003 


.970 


.125 


.162 


.996 



The correlations between true and asymptotic ability were uni- 
formly .996. This was a slight decrease from Table 31 where they 
were almost all .999. 

Asymptotic estimate statistics for the heterogeneous condition 
are printed in T.ble 40. Slight changes from ^le 39 appeared in 
the means, but the no-linking method still ^^f.J^J^^]^ 1 
the equivalent- tests method produced the most. Slight changes also 
occurred Tn the standard deviations but none were of any consequence. 

Xn the heterogeneous condition, the no-linking procedure produc- 
ed the least absolute and root-mean-square errors of the Parameter 
est mates. The maximum-likelihood procedures were typically next in 
line but the Bayesian procedures closed the gap considerably . The 
equivalent- tests procedure still produced the most error. 
£uare error was Invariably less for the heterogeneous con * "on than 
i? had been for the homogeneous condition. Absolute error typically 
exhibited the same behavior but a few increases were observed This 
decrease was sJmiliar to the one observed in the data set containing 
randomly sampled examinees. 

The correlations between true and asymptotic ab ^ty ranged from 
q<>5 to 996. These were too close in value to make any meaningful 
contrast between methods. The decrease from the homogeneous condition 
was extremely slight. 
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Table UQ. Asymptotic Ability Estimates— Equivalence Methods 
heterogeneous Condition Using Systematically Sampled Examinees 



Pietnod 


Mean 


SD 


Absolute 
Error 


RMS 
Error 


R 


uayesian 


-.051 


1 . 1«3 


.iw 


.195 


.996 


Progressed Bayes 


-.056 


1 .100 


.121 


.166 


.996 


Max. Likelihood 


-.075 


.928 


.130 


.157 


.995 


Regressed M.L. 


-.066 


.992 


.107 


.136 


.996 


Robust M.L. 


-.076 


.930 


.132 


.158 


.995 


Rob. Reg. M.L. 


-.071 


.972 


.114 


. 1**2 


.995 


Equivalent Tests 


-.207 


1 .022 


.216 


.231 


.996 


No Linking 


-.013 


.962 


.095 


. 127 


.995 



Eff iciency of ability estimation . Table U1 presents calibration 
and linking efficiencies for the ..homogeneous condition with system- 
atically sampled examinees. The first entry in the first column in- 
dicates that slightly less information was available from true param- 
eters in this data set than for the randomly sampled examinees ( 3m 
vs. .319 units per item). Efficiency of calibration, as indicated by 
the first entry in the second column, was also slightly less (.887 
vs. .898). 

Linking efficiencies, presented in the third column (Table H1 ) 
were somewhat lower than those obtained with randomly sampled examinees 
(Table 33) and also somewhat more variable. In general, the equival-nt- 
tests method produced tie highest relative efficiency, .971. This was 
slightly higher than it produced in the random sampling environment 
The Bayesian methods were next, both with .9^rThe maximum-likeli- 
hood methods ranged from .956 to .961. The o-linking procedure 
resulted in an efficiency of .957. By way of comparison, except for 
the equivalent-tests method, efficiencies in the random sampling 
environment were .988 to .989. 

Table 42 presents relative efficiencies for the heterogeneous 
condition. The calibration efficiency, .889,. was essentially un- 
changed (as it should have been since any change would be due solely 
to sampling). Linking efficiencies were all lower in this condition 
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Table 41. Efficiency Analysis — Equivalence Methods 
Homogeneous Condition Using Systematically Sampled Examinees 



Method 


Average 
Item 
Information 


Efficiency Relative to 
True Estimated 
Parameters Parameters 


True Parameters 


.314 






Est. Parameters 


.278 


.387 




Bayesian 


.^68 


.855 


.964 


Progressed Bayes 


.268 


.855 


.964 


nax . LiKeiinoou 


• cO ( 


.850 


.958 


Regressed M.L. 


.267 


.853 




Robust M.L. 


.266 


.349 


.956 


Rob. Reg. M.L. 


.267 


.831 


.959 


Equivalent Tests 


.270 


.862 


.971 


No Linking 


.266 


.849 


.957 



with the maximum-likelihood procedure being the lowest, .904. The 
equivalent-tests procedure produced the highest efficiency, .949, but 
the Bayesian procedure was close, .942. 

All equivalent-groups and the no-linking procedures had lower 
efficiencies in the systematic sampling environment than in the ran- 
dom sampling environment. This was expected since a theoretically 
crucial assumption was violated* The equivalent- tests method lost no 
efficiency, as should also have been expected since no assumption 
violations occurred. 

Table 43 presents linking efficiency of the Bayesian equivalent- 
groups method as a function of test length and sample size. Effi- 
ciencies appeared to increase with increasing sample size, but this 
trend was not smooth and was somewhat inconsistent when the 12 cell 
entries were compared. No trend with test length was obvious. Again, 
essentially no trends were observed in the randomly sampled data set. 

Table 44 presents linking efficiency of the equivalent-tests 
method as a function of test length and sample size. No trend with 
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Table 42. Efficiency Analysis — Equivalence Methods 
Heterogeneous Condition Using Systematically Sampled Examinees 



Average Efficiency Relative to 
Item True Estimated 

Method Information Parameters Parameters 



True Parameters 


.305 






Est. Parameters 


.271 


.889 




Bayesian 


.255 


.837 


.942 


Progressed Baye3 


.255 


.835 


.940 


Max. Likelihood 


.245 


.804 


.904 


Regressed M.L. 


.249 


.815 


.918 


Robust M.L. 


.250 


.319 


.922 


Rob. Reg. M.L. 


.252 


.828 


.932 


Equivalent Tests 


.257 


.844 


.949 


No Linking 


.248 


.814 


.916 



Table 43. Cellwise Efficiency Analysis 
Bayesian Score — Systematically Sampled Examinees 



Sample 
Size 



20 



Item Set Size 



35 



50 



55 



Average 



500 


.961 


.917 


.934 


.970 


.951 


1000 


.969 


.939 


.990 


.982 


.970 


2000 


.966 


.971 


.994 


.950 


.970 


Average 


.965 


.942 


.979 


.967 
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Table 44, Cellwlse Efficiency Analysis 
Equivalent Tests—Systematically Sampled Examinees 



Sample T est Length 

Size 20 35 50 65 Average 



500 


.969 


.907 


.985 


.990 


.963 


1000 


.977 


.990 


.978 


.992 


.984 


2000 


.926 


.957 


.99i 


.986 


.965 


Average 


.957 


.951 


.985 


.989 





respect to sample size was obvious. Efficiency did appear to increase 
with test length in the marginal entries, although this trend was in- 
consistent in the individual rows. These findings regarding trtnds 
are consistent with those for the randomly sampled data set. 



Discussio n 

Many of the data presented in this section were conflicting and 
inconsistent. Depending on which analyses were done, the different 
methods varied from best to worst. Fidelity analyses suggested that 
the equivalent-tests method was best and the maximum-likelihood pro- 
cedure was second best. Evcluation of asymptotic ability estimates 
suggested that the equivalent-tests method produced the greatest asymp- 
totic error of estimation. Efficiency analyses suggested that the 
equivalent-tests method was most efficient and the Bayesian procedures 
were almost as efficient. 



The efficiency analysis probably produces the best answers to 
questions of which procedure is best. It is the goal of linking, 
after all, to produce a set of items that will function efficiently 
together. The facts that the parameters are not "most true" or that 
the ability scale is not at arbitarily targeted levels are secondary 
to the goal of efficiency of measurement. Efficiency analyses are 
probably most useful in selecting a procedure. 

Accepting the previous argument, several observations can be 
made. First, the equivalent-tests method is the most efficient when 
examinees are systematically sampled, as they were hene. Second, the 
Bayesian procedures are nearly as efficient with systematic sampling 
and, as was observed earlier, are more efficient when lexaminees are 
randomly sampled. At some point between the extremes/ in sampling in- 
vestigated here, the Bayesian procedures could be expected to become 
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superior. Of the two Bayesian procedures, neither was clearly superior, 
but the simple (i.e., unprogressed) procedure was easier to compute 
and therefore preferable. 

Analysis of the two methods by test length and sample size sug- 
gested that there was a slight increase in efficiency of the equiva- 
lent-tests method as test length increased and a slight increase in 
efficiency of the Bayesian equivalent-groups procedure as sample size 
increased. These increases were small and inconsistent, however, and 
suggested that all of the test lengths and sample sizes investigated 
were nearly equivalent in terms of resulting efficiency for both the 
equivalent-tests and Bayesian methods. 



Anchor Group Method 

Procedure 

The anchor group linking method is, conceptually, very similar 
to the equivalent-groups method. The major conceptual distinction is 
that the anchor group method uses a single group of examinees for all 
linking and thus does not need to assume the statistical equivalence 
of several different groups. 

In this research, eight different anchor groups were evaluated. 
The eight groups comprised four examinee sample sizes (10, 30, 50, 
and 100) and two distribution forms (rectangular and normal). The 
rectangular samples consisted of abilities evenly spaced between -1.7 
and 1.7. The normal samples were created by selecting normal devi- 
ates corresponding to evenly spaced percentiles from 2.0 to 98.0. 
Values thjs obtained for both normal and rectangular samples were 
then standardized to essure that the samples obtained had means of 
exactly zero and variances of exactly one. 

Linking by the anchor group method was done for all parameters 
in the systematically samplea data set. This was accomplished by ad- 
ministering all 60 tests in the data set to each of the examinees in 
each of the anchor groups. Item parameters were then adjusted using 
the same equations used for the equivalent groups method, Equations 
14 and *<t>. Two scoring procedures, the modal Bayesian procedure and 
the robust-maximun-likelihood procedure were used for this linking. 

Results — Modal Bayesian Scores 

Fidelity of parameter estimation . Table 45 presents the item 
parameter error statistics for the anchor group linking method for 
each anchor group size and composition in the homogeneous linking 
condition using mJdal Bayesian estimates. The first two columns pre- 
sent the means and standard deviations^of the true a and b parameters 
averaged over cells in the systematically sampled data set. These 

O 




Table 45. Item Parameter Error— Anchor Group!: 
Homogeneous Condition Using Systematically Sampled Examinees 





True 


Bias 


in 


Absolute 


RMS 




Method 


nean 




Mean 


SD 


Error 


Error 


R 


Normal 10 
a 
b 


1.588 


.501 

1 7UU 


-.080 
.180 


.033 
.479 


.393 
.440 


.540 
.671 


.519 
.977 


Normal 10 
. a 
b 


1.588 


.501 


-.076 
.168 


.017 
.443 


.380 
.409 


.521 
.614 


.527 
.979 


Normal SO 

l% \J I IUQ 1 *J 

a 

b 


1.588 


.501 


-.086 
.186 


.019 
.469 


.381 
.424 


.525 
.644 


.529 
.979 


Normal 100 
a 
b 


1.588 

• cue 


.501 
• • 3 *^ 


-.101 
.193 


.011 
.480 


.374 
.432 


.516 
.659 


.530 
.979 


U n 1 f o rm 10 

V 11 A L V 1 III 1 >S 

a 

b 


1.588 


• 501 
1 ?uu 


-.110 
.193 


.024 
.516 


.395 
.470 


.545 
.717 


.516 
. .976 


Uniform 30 

V 11*1 \r 1 Uf ^ v 

a 

b 


1.588 
.262 


.501 
1 .344 


-.135 
.192 


.006 
.530 


.386 
.469 


.529 
.706 


.520 
.977 


Uniform 50 
a 
b 


1.583 
.262 


.501 
1 . 


-.137 
.203 


.001 

.530 


OT 0 

. 37 0 
.470 


coo 

.523 
.712 


con 

.979 


Uniform 100 
a 
b 


1.588 
,262 


.501 


-.115 
.208 


.003 
.497 


.372 
.448 


.516 
.681 


.531 
.980 


No Linking 

a u< 
b 


1.588 
.262 


.501 
1.3M 


.139 
.130 


.084 
.237 


.450 
.364 


.602 
.464 


.533 
.971 



values are the same as those presented In Table 37 and will not be 
discussed again here. 

Biases in the estimated item parameters are presented In columns 
three and four. With the exception of the no-llnklng group, all 
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groups tended to underestimate the a parameters. All groups tended 
to overestimate the b parameters, with a trend for increasing bias 
with increasing group size. The no-linking method revealed the least 
b-parameter bias, while th? normal group showed the least bias over- 
all. In terms of bias in parameter standard deviations, the uniform 
group showed least bias in the a parameters and the normal group show- 
ed least bias in the b parameters. Again, the no-linking method show- 
ed the least bias in the b parameters overall. 

Absolute and root-mean-square errors of the parameter estimates 
are presented in columns five and six. A slight trend toward decreas- 
ing absolute error in the a parameters with increasing anchor group 
size was apparent for both distributions, although it was more pro- 
nounced with the uniform anchor groups. No consistent differences 
were apparent between the group compositions with respect to a-param- 
eter absolute error, but both produced less error than the no-linking 
procedure. Absolute error of the b parameters suggested different 
conclusions: There were no noticeable decreases with increasing anchor 
group sizes for the normal group and there were slight decreases for 
the uniform group. The no-linking procedure produced the least error, 
and the uniform groups consistently produced the most error. The same 
conclusions drawn frottt the absolute errors could also be drawn from 
the root-mean-square errors. 

The correlations between true and estimated a and b parameters 
are shown in the last column of Table 45. There was a slight in- 
creasing trend in both the a- and b-parameter correlations with in- 
creasing anchor group size for both shapes of ability distribution. 
The no-linking procedure produced ^-parameter correlations slightly 
higher than those of other methods and b-parameter correlations that 
were slightly lower. 

The fidelity-of-calibration data for the heterogeneous condition 
are presented in Table 46. Since observations about the true item 
parameters remain the same across linking methods, they will not be 
repeated here. 

The parameter biases presented in columns three and four were 
essentially the same as those of the homogeneous case. The bias of 
the ^-parameter means tended to be somewhat smaller for the homogen- 
eous case while the same trend was observed with respect to bias in 
the a-^parameter standard deviations. For the b parameters, however, 
the bias in both the mean and standard deviation were greater in the 
heterogeneous condition. 

Parameter errors depicted in columns five and six were essen- 
tially the same as those for the homogeneous case for the a param- 
eters. The differences between the heterogeneous and homogeneous 
conditions appeared in the third decimal place for the a parameters. 
The b-parameter errors for the heterogeneous condition showed a 
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Table 16. Item Parameter Error— Anchor Groups 
Heterogeneous Condition Using Systematically Sampled Examinees 



True Bias in Absolute RMS 



Method Mean SD Mean SD Error Error R 

Normal 10 

a 1.586 .500 -.082 .045 . 391 .538 .497 

b .281 1.371 .203 .501 .U50 .680 .972 

Normal 30 

a 1.586 .500 -.077 .027 .384 .522 .507 

b .281 1.371 .189 .168 .119 .622 .971 

Normal 50 

a 1.586 .500 -.087 .029 .385 .526 .501 

b .281 1.371 .207 .192 .135 .653 .971 

Normal 100 

a 1.586 .500 -.102 .017 .377 .517 .515 

b .281 1.371 .211 .50U . 113 .667 .973 



Uniform 10 



a 1.586 .500 -.111 .010 .100 .517 /l77 

b .281 1.371 .219 .550 .133 .^30 .968 

Uniform 30 

a 1.586 .500 -.137 .011 .389 .530 .198 

b .281 1.371 .215 .557 .182 .718 .972 

Uniform 50 

a 1.586 .500 -.138 .006 .381 .521 .505 

b .281 1.371 .221 .557 .182 .721 .972 

Uniform 100 

a 1.586 .500 -.117 .008 .371 .516 .513 

b .281 1.371 .229 .525 .159 .690 .973 

No Linking 

a 1.586 .500 .138 .127 .155 .601 .U81 

b .281 1.371 .116 .216 .368 .166 .971 



slight increase over the homogeneous condition. Absolute errors of 
the b parameters showed no noticeable trends with increasing anchor 
group size for the normal groups but showed a slight decreasing trend 
with increasing uniform anchor group size. Many of the same conclu- 
sions could also be drawn from the root-mean-square errors. 
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Whereas bias and error ^statistics were quite similar for the 
homogeneous and heterogeneous conditions, the correlations between 
true and estimated parameters showed a noticeable drop from their 
corresponding values in the homogeneous condition. Differences in the 
second decimal place were observed for the a parameters and in the 
third decimal place for the b parameters. There was a slight tendency 
for the correlations to increase with increasing dnchor group size. 
The no-linking procedure's correlation for the a parameters was f how- 
ever, somewhat lower than most correlations produced by the anchor 
group procedures. 

Characteristics of asympto tic abi lity estim ates. Table 47 pre- 
sents descriptive statistics~for the^asymptotic ability estimates in 
the homogeneous case. Mean asymptotic ability estimates were close 
to zero for all cases while the corresponding standard deviations 
were close to one. For the most part, means were overestimated, as 
were the standard deviations. 

Table 47. Asymptotic Ability Estimates— Anchor Groups 
Homogeneous Condition Using Systematically Sampled Examinees 



Absolute RMS 



Method 


Mean 


SD 


Error 


Error 


R 


Normal 10 


.005 


1 .070 


.085 


.131 


.996 


Normal 30 


-.009 


1 .066 


.081 


. 129 


.996 


Normal 50 


.OOU 


1 .070 


.081 


. 129 


.996 


Normal 100 


.004 


1 .078 


.081 


.131 


.996 


Uniform 10 


.003 


1 .092 


.105 


. 156 


.996 


Uniform 30 


-.005 


1 . 104 


. 101 


.151 


.996 


Uniform 50 


.005 


1 . 108 


.098 


.157 


.906 


Uniform 100 


.017 


1 .091 


.085 


. 1U2 


.996 


No Linking 


.003 


.970 


.125 


.162 


996 



Absolute error presented in column three was lowest for the nor- 
mal anchor group and greatest for the no-linking procedure. Absolute 
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error appeared to decrease with increasing anchor group size for the 
uniform anchor group. No trend was obvious for the normal group. 

Root-mean-square error, presented in column four t showed the same 
differences among linking methods. Trends within methods as a function 
of anchor group size were not apparent. 

Correlations between the true and asymptotic ability, shown in 
column five, were uniformly .996. 

Statistics for the asymptotic ability in the heterogeneous case 
are presented in Table 48. Slight changes were observed from the 
homogeneous condition, for the means and standard deviations. Whereas 
the homogeneous condition tended to overestimate the means, the heter- 
ogeneous condition tended to underestimate them. Standard deviations 
of the asymptotic estimates for the heterogeneous condition were 
smaller than for the homogeneous condition. 



Table 48. Asymptotic Ability Estimates — Anchor Groups 
Heterogeneous Condition Using Systematically Sampled Examinees 



Absolute RMS 

Method Mean SD ^Error Error R_ 

.125 .996 

.117 .996 

.117 .996 

.125 .996 

.130 .996 

.139 .996 

.140 .996 

.131 .996 

.127 .995 



Absolute and root-mean-square errors of the asymptotic estimates 
were uniformly lower in the heterogeneous condition than in the homo- 
geneous condition. Trends with respect to anchor group size were not 



Normal 10 


.004 


1 


.065 


.085 


Normal 30 


-.012 


1 


.061 


.075 


Normal 50 


-.001 


1 


.066 


.072 


Normal 100 


.000 


1 


.075 


.078 


Uniform 10 


-.000 


1 


.082 


.085 


Uniform 30 


-.oo^*-^ x i 


.100 


.096 


Unifdp 5y 


-.00K 




.103 


.095 












Uniform^TOO 


.014 




.088 


.081 


No Linking 


-.013 




.962 


.095 



-117-. _ 

122 



apparent, however, and the no-linking method was not consistently in- 
ferior . 

Correlations between true and asymptotic ability were identical 
to the homogeneous condition (i.e., .996) for the anchor group pro- 
cedures. The no-linking procedure produced a correlation slightly 
lower in the heterogeneous condition. 

Efficiency of ability estimation . Table 49 presents the efficien- 
cies achieved by the homogeneous linking condition with systematically 
sampled examinees. The average item information, presented jn the 
first column, was nearly identical for both the normal and uniform 
groups and increased as sample size increased. The no-linking group 
showed the lowest average item information. 



Table 49. Efficiency Analysis— Anchor Groups 
Homogeneous Condition Using Systematically Sampled Examinees 



Method 


Average 
Item 
Information 


Efficiency Relative to 
True Estimated 
Parameters Parameters 


True Parameters 


.314 






Est. Parameters 


.278 


.887 




Normal 10 


.272 


.869 


.979 


Normal 30 


.274 


.875 


.996 


Normal 50 


.274 


.875 


.986 


Normal 100 


.275 


.876 


.987 


Uniform 10 


.272 


.866 


.976 


Uniform 30 


.274 


.873 


.983 


Uniform 50 


.275 


.876 


.987 


Uniform 100 


. tt rs 


.877 


.988 


No Linking 


.266 


.849 


.957 



Linking efficiency, shown in the third column, showed a slight 
rise as sample size went from 10 to 30 but negligible change from 30 
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to 100. There were no consistent differences between the two anchor 
group distributions. The no-iinking case showed the lowest efficien- 
cy, .957. 

Relative efficiencies for the heterogeneous condition are pre- 
sented in Table 50. The same trends were apparent here (except for 
rounding error) as were shown for the homogeneous case. Information 
values and relative efficiencies were markedly lower for the hetero- 
geneous condition than for the homogeneous condition. As before t a 
sharp rise was noted as sample size increased from 10 to 30 t but 
there were negligible increases thereafter. 



Table 50. Efficiency Analysis — Anchor Groups 
Heterogeneous Condition Using Systematically Sampled Examinees 





Average 


Efficiency Relative to 


Method 


Item 
Information 


True 
Parameters 


Estimated 
Parameters 


True Parameters 


.305 






Est. Parameters 


.271 


.839 




Normal 10 


.259 


.850 


.956 


* 

Normal 30 


.261 


.857 


.964 


Normal 50 


.260 


.855 


.962 


Normal 100 


.261 


.858 


.966 


Uniform 10 


.25' 


.815 


.951 


Uniform 30 


.261 


856 


.963 


Uniform 50 


.261 


.858 


.966 


Uniform 100 


.26' 


.860 


.968 


No Linking 


.248 


. 81 M 


.916 



Results — Robust-MaxlMum-Llkellhood Scores 

Fidel ity of parameter estimation . Table 51 1 condensed table 
of the modal Bayesian and robust-maxin jm-likelihou tern parameter er- 
ror statistics for the anchor group linking design in the homogeneous 
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Table 51. Item Parameter Error — Anchor Groups 
Homogeneous Condition Using Systematically Sampled Examinees 



Bayeslan Maximum Likelihood 

Bias In RMo Bias In RMS 



Method Mean SD Error R Mean SD Error R 

Normal 10 

~ a -.054 .060 .562 .489 .699 .391 1.168 .438 

b .151 .422 .590 .978 -.035 -.118 .331 .973 

Normal 30 

a -.076 .037 .552 .488 .454 .256 .834 .444 

b .164 .429 .597 .981 -.004 .007 .426 .968 

Normal 50 

a -.052 .051 .562 .486 .441 .244 .857 .467 

b .166 .419 .597 .979 -.016 .002 .320 .975 

Normal 100 > 

a -.107 .025 .541 .487 .483 .263 .896 .462 

b .203 .463 .653 .980 -.023 -.027 .307 .976 

Uniform 10 

a -.060 .066 .601 .463 -.007 .182 .706 .381 

b .185 .447 .637 .975 .160 .531 .905 .952 

Uniform 3^ 

a -.127 .023 .549 .483 .120 .165 .640 .478 

b .182 .5C0 .671 .979 071 .300 .581 .971 

Uniform 50 

a -.117 .030 .555 .485 .175 .174 .717 .457 

b .207 .499 .684 .979 .079 .222 .426 .974 

Uniform 100 

a -.105 .028 .546 .487 .169 .160 .670 .453 

b .207 .478 .673 .980 .07? .232 .497 .973 

No Linking 

a .143 .112 .629 .501 .143 .112 .629 .501 

D .147 .228 .444 .973 .147 .228 .444 .973 



case. The table values represent averages taken over fcjr cells of 
the data matrix (I.e. 1000 examinees and 20, 35, 50, and 55 Items), 
rather than over the entire 3x4 matrix, as In the previous section. 
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Whereas the bias in the ai-pararaeter means, using modal Bayesian 
estimation, tended to be slightly negative for both the normal and 
uniform groups (indicating that the a parameters were underestimated), 
the robust-maximum-likelihood procedure grossly overestimated the 
means for the normal group and slightly overestimated the means for 
the uniform group. The t-*nds with respect to the b-parameter biases 
were reversed from those noted for the a parameters. The robust-maxi- 
mum-likelihood procedure produced a b-p^rameter mean that was much 
closer to the true value of 0.0 than did the modal Bayesian estimate. 
The normal group tended to prQduce slight underestimates of the b- 
parameter mean while the uniform group produced slight overestimates. 
Both groups produced overestimates of the b mean when the modal 
Bayesian scoring procedure was used. 

The same general trends noted for the bias in parameter means 
held also for the biases in the parameter standard deviations. The 
robust-maximum-likelihood estimates tended to overestimate the a- 
parameter standard deviations more than their counterparts in the 
Bayesian case. As was the case for the b-parameter means, the ro- 
bust-maximum-likelihood estimates of the standard deviations were 
much closer to the true value of 1.0 than were the modal Bayesian 
estimates. The normal groups revealed a much smaller bias in b- 
parameter standard deviations than did the uniform groups using 
robust maximum likelihood. The Bayesian modal estimates showed very 
little difference between the normal and uniform groups. 

In terms of root-mean-square error in the a parameter, modal 
Bayesian procedures showed the least error, regardless of distribu- 
tion shape. On the other hand, robust-maximum-likelihood procedures 
provided the smallest errors for the b parameters. The normal group 
produced less error than the uniform group, with a slight tendency 
for increasing error with increasing anchor group size. 

The correlations between true and estimated parameters were con- 
sistently higher with modal Bayesian procedures than with robust-max- 
imum-likelihood procedures although in several instances the differ- 
ences were in the third decimal place. There were no consistent 
differences among group compositions or sizes. As usual, correla- 
tions for the b parameters were considerably higher than for the a 
parameters. 

Characteristics of asymptotic ability estimates, . Table 52 pre- 
sents summary statistics for the asymptotic ability estimates using 
both modal Bayesian and robust-maximum-likelihood procedures. The 
robust-maximiro-likelihood procedure resulted in slight underestimation 
of the means for both the normal and uniform groups. Standard devia- 
tions were also underestimated, compared to the modal Bayesian groups 
which tended to overestimate the standard deviation. For the robust- 
maximum- likelihood procedures, there was a noticeable difference be- 
tween the normal group, which produced underestimated standard 
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Table 52. Asymptotic Ability Estimates— Anchor Groups 
Homogeneous Condition Using Systematically Sampled Examinees 



Bayesian Maximum Likelihood 



Method 


Mean 


SD 


RMS 


R 

n 


Mean 


en 
OU 


RMS 
Error 


K 


Normal 10 


-.006 


1 .0U5 


1 1U 

.lit 




— • u J 1 


7511 
• lit 


one 


.990 


Normal 30 


- 009 


1 066 


125 




— • UDO 


• If 0 


. OO 




Normal 50 


-.oou 


1 .044 


. 108 


.996 


-.018 


.791 


.236 


.996 


Normal 100 


.010 


1 .080 


.131 


.996 


-.044 


.779 


.217 


.996 


Uniform 10 


.013 


1 .061 


. 126 


.997 


-.048 


.993 


.136 


.997 


Uniform 30 


-.009 


1 .093 


. 144 


.996 


-.049 


.932 


.133 


.997 


Uniform 50 


.012 


1 .090 


.131 


.996 


-.015 


.920 


.143 


.996 


Uniform 100 


.016 


1 .078 


.128 


.996 


-.033 


.911 


.135 


.996 


No Linking 


.034 


.952 


.133 


.996 


.031 


.962 


.133 


.996 



deviations, and the uniform group, which produced overestimated 
standard deviations. 

In terms of root-mean-square error, there were again notable 
differences between the normal and uniform groups using robust-maxi- 
mum-likelihood procedures. The normal group had bias values consid- 
erably greater than its counterpart using modal Bayesian procedures 
while the uniform group had error values quite comparable to their 
Bavesian counterparts. The normal-group errors, using robust-maxi- 
mu.,-lilscelihood scoring, were by far the largest of any of the methods. 

Correlations between true and estimated parameters using robust- 
maximum- likelihood procedures wre uniformly high (.996) and virtual- 
ly identical to their Bayesian counterparts. 

efficiency of ability estimation . Table 53 presents comparisons 
of robust-maximum-likelihood with modal Bayesian procedures in terms 
of relative efficiencies achieved by each method. The average amount 
of information available per item tended to be higher for, the modal 
Bayesian procedures than for the robust-maximum-likelihood procedures. 
This, of course, meant that the efficiencies relative to the true and 
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Table 53. Efficiency Analysis — Anchor Groups 
Homogeneous Condition Using Systematically Sampled Examinees 



Maximum Likelihood 







Efficiencies 
Relative to 




Efficiencies 
Relative to 


Method 


A \t a TV ahi 
AVg. lUeiu 

Info. 


True 
Params. 


Params. 


Avg. Item 
Info. 


True 
Params. 


Est. 
Params. 


True Params. 


.306 






.306 






Est. Params. 


.270 


.882 




.270 


.882 




Normal 10 


.265 


.866 


.983 


.257 


.840 


.953 


Normal 30 


.267 


.874 


.991 


.262 


.857 


.972 


Normal 50 


.266 


.870 


.987 


.265 


. 000 




Normal 100 


.267 


.873 


.991 


. cD4 


.862 


.978 


Uniform 19 


.263 


.860 


.976 


.252 


.824 


.935 


Uniform 30 


.267 


.872 


.989 


.262 


.856 


.971 


Uniform 50 


.267 


.872 


.990 


.262 


.858 


.973 


Uniform 100 


.267 


.873 


.990 


.264 


.865 


.981 


No Linking 


.260 


.850 


.964 


.260 


.850 


.964 



estimated parameters were also higher for modal Bayesian than for ro- 
bust-maximum-likelihood procedures. The magnitude of differences were, 
with one exception, in the second decimal place. 

The normal group showed no consistent trend with increasing group 
size. The uniform group showed a tendency for increasing efficiency 
with increasing group size. These trends appeared for both modal 
Bayesian and robust-maximum-likelihood procedures. 

Discussion 

Most of the analyses thus far have presented rather conflicting 
results. Different analyses have suggested different procedures that 
were "best." Using fidelity-of-parameter estimation as a criterion, 
modal Bayesian procedures tended to produce more accurate estimates 
of the a parameter while the robust-maximum-likelihood procedures 
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tended to produce more accurate estimates of the b parameter. Within 
the modal Bayesian procedures, there did not appear to be any clear- 
cut advantage to either group composition. For the robust-maximum- 
likelihood procedures, there was a clear trend for the normal groups 
to produce consistently better estimates for the b parameters than 
those estimates produced from the uniform groups. 

Using asymptotic ability estimates as the evaluative criterion, 
modal Bayesian procedures with normally distributed anchor group abil- 
ities appeared to be consistently best. Modal Bayesian procedures 
with uniformly distributed abilities were second best. Robust-maximum- 
likelihood scoring using uniform and normal anchor groups followed in 
that order. 

Modal Bayesian procedures showed efficiencies consistently high- 
er than robust-maximum- likelihood procedures regardless of anchor 
group composition or size. With the modal Bayesian procedures, the 
normal groups tended to yield slightly more efficiency than did the 
uniform groups. Both groups were superior to the no-linking condition. 



Anchor Test Method 



Procedure 

Generation of the source item po ol . The first step in the ap- 
plication of the anchor test method was to construct a source item 
pool from which the anchor tests could be selected. To obtain the 
source item pool, 200 £, b, and c parameters were independently gener- 
ated as discussed previously. The first four central moments of each 
of these distributions matched those specifier earlier as being repre- 
sentative of a "typical" ASVAB item pool. These parameters represent- 
ed the "true" parameters of 200 hypothetical items. 

Dichotomous item responses for these 200 items were simulated 
for 4000 examinees randomly selected from a distribution of abilities 
with distributional moments representative of the total AFEES popula- 
tion. All examinees responded according to the three-parameter logis- 
tic IRT model. Item parameter estimates were obtained for these 200 
items using program 0GIVIA. The items were, due to computer program 
limitations, calibrated in two sets of 100 items each. 

Selection of anchor-test items . Three different 25-item anchor 
tests were constructed by selecting items from the original set of 200 
items. These anchor tests were constructed 30 that their test infor- 
mation curves were approximately normal, rectangular, and peaked. 

The peaked test was constructed by selecting the 25 items which 
provided the most information at theta equal to zero, according to 
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their estimated item parameters; this is the way items would typically 
be selected for inclusion in a peaked test. In order to get an indi- 
cation of the amount of information actually contained in this test, 
t) 3 true information was computed, using the true item parameters, for 
61 theta values at intervals of .10 from -3.00 to 3.00. These infor- 
mation values were then averaged across 61 theta values; this average 
was 8.320. 

Items for the rectangular and normal tests were selected so that 
their test information curves were shaped approximately rectangular 
and normal, respectively, and so that the true test information, 
computed using the true item parameters and averaged as before over 61 
theta values from -3.00 to 3.00, approached the value obtained by the 
peaked test. These averages were 8. 410 and 8.232 for the rectangular 
and normal tests, respectively. When the test information was comput- 
ed on the basis of the estimated item parameters, these averages were 
8.485, 9.294, and 9.121 for the peaked, rectangular, and normal tests, 
respectively. Figure 9 presants the true information curves, bas*d on 
the true iten para^ete^s, for th** three ?5-item ^nohor tests. 



Figure 9. True Information Curves, Using True Item Parameters, 
for Each of Three Anchor Tests 




7 



V 



Two additional embedded te$ts for each of these three anchor 
tests were obtained by. selecting the first five Items and the first 
15 Items from each. Thus, the nine anchor tests considered here com- 
prised three groups of 5- f 15- f 4nd 25-ltem tests, each of whose test 
Information curves for these testp were approximately normal, rec- 
tangular ,\ and peaked, respectively. The Items Included In these 
anchor tests are presented In Appendix Table A-2. 

' ' Determination of the linking transformations . The nine anchor 

tests JP^re "administered 11 to the 70,000 examinees comprising the 
systematically sampled basic data set. This simulation was accom- 
plished bv generating response vectors using the true theta levels 
of these examinees and then scoring the anchor tests. Once Item re- 
sponses were available for the Items In each anchor test, a modal 
Bayeslan estimate of ability was computed for each examinee on each 
, anchor test, using a standard normal prior distribution of abilities 

\ and scoring each response vector using the estimated Item parameters. 

\ For each of the 60 calibration groups, the mean and standard devia- 

\ tlon of estimated ability were computed on each of the nine anchor 

^ . tests. These values were then used for the transformation constants 
for anchor-test linking. 

\t 

Linking under the anchor-test method Is accomplished by trans- 
forming the non-anchor-test Item parameters such that the mean and 
standard deviation of ability of the groups under consideration, as 
estimated from the non-anchor test, match the mean and standard devi- 
ation of ability estimated from the anchor test alone. When the 
transformation constants k and m are applied In the form presented by 
Equations 14 and 15, the constants k and m may be expressed as: 

k = o r /o 0 [30] 

and m = y - ky n [31] 

where and Op are, respectively, the mean and standard deviation of 
ability estimates In the non-anchor test and p a and o~ are the cor- 
responding statistics for the anchor test. 

Results— Modal Bayeslan Scores 

Fidelity of parameter estimation . Fldellty-of-estlmatlon sta- 
tistics for the homogeneous condition, using the Bayeslan scoring 
technique, are presented in Table 5 1 *. The true means and standard 
deviations of the a and b parameters are presented In the first two 
columns of this table. Columns three and four present the bias in 
the means and standard deviations of the item parameters. The largest 
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Table 51. Item Parameter Error— Anchor Tests 
Homogeneous Condition Using Systematically Sampled Examinees 



Method 

Normal 5 
a 
b 

Normal 15 
a 
b 

Normal 25 
a 
b 

Rectangular 5 
a 
b 

Rectangular 15 
a 
b 

Rectangular 25 
a 
b 

Peaked 5 
a 
b 

Peaked 15 
a 
b 

Peaked 25 
a 
b 



True Bias in Absolute RMS 

Mean *SD~ Mean SD Error Error 



1.588 .501 .574 .237 .718 .874 .532 
.262 1.311 .135 -.091 .258 .350 .979 



1.588 .501 .095 .076 .411 .552 .531 
.262 1.311 .226 .266 .320 .509 .980 



1.588 .501 .067 .067 .105 .511 .530 
.262 1.311 -232 .293 .333 .529 .980 



1.588 .501 .100 .182 .589 .738 .530 
.262 1.311 -168 .020 .253 .365 .980 



1.588 .501 .095 .077 .116 .551 .532 
.262 1.311 -227 .267 .321 .506 .980 



1 588 .501 .012 .058 .396 .536 .531 

'.262 1.311 .233 .318 .311 .511 .980 

1 583 .501 1.092 .118 1.169 1.359 .531 

'.262 1.311 -029 -.332 .312 .130 .930 



1.588 .501 .617 .255 .751 .9U .531 
.262 1.311 .102 -.115 .255 .311 .930 



1.588 .501 .157 .201 .629 .780 .529 
.262 1.311 -115 -.017 .218 .359 .979 



No Linking . ^ ^ m ^ ^ ^ 

b .262 1.311 .130 .237 .361 .164 .971 
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biases in the mean of the £ parameters were observed for the peaked 
tests, and ranged from .457 for the 25-item anchor test to 1.092 for 
the 5-item anchor test. The smallest biases in the means were ofe-^ 
served for the rectangular tests, although the biases for the nor- 
mal tests were only slightly higher at the longer test lengths. The 
smallest biases were observed for the 25-item normal and rectangular 
tests, with values of .067 and .042, respectively. When no linking 
was performed, the bias in the mean of the a parameters was .139; 
this value was exceeded by all three peaked tests, but only by the 
5-item normal and rectangular tests. 

Biases in tht standard deviations of the a parameters were larg- 
est for the peaked tests, ranging from .201 to - . 418. Again, there 
was little difference observed between the biases in the standard de- 
viations of the a paramete for the normal and the rectangular tests, 
although they were slightly smaller for the rectangular tests. The 
smallest bia&cs w*r<» observed for the 25-item normal and rectangular 
tests. As before, biases for all three peaked tests exceeded the 
value of .084 observed in the no-linking condition, whereas only the 
5-item normal and rectangular tests exceeded this value. Biases in 
both the means and^the standard deviations of the a parameters de- 
creased with increased test length. 

The smallest biases in the mean of the b parameters were ob- 
served for tke-t)eaked tests; these values-ranged from .029 to .145. 
There were essentially no differences between the rectangular and nor- 
mal tests in terms of bias in the mean b's; these values clustered 
between .135 and .233. These bias figures increased with increased 
test lengths for all three anchor test types. In the no-linking 
condition, bias in the mean b's was .130, which was exceeded by all 
tests except the 5- and 15-item peaked tests. 

The standard deviations of the b parameters were underestimated 
for the peaked tests, since all these bias values were negative, rang- 
ing from -.017 to -.332. The differences between the normal and rec- 
tangular tests were not consistent, though the normal test was some- 
what better at test lengths greater than five items. The bias in the 
b-parameter standard deviation was .237 in the no-linking condition, 
and this value was exceeded by all the tests except the shortest normal 
and rectangular tests and the two longest peaked tests. 

Mean absolute and roofr-mean-square errors in the pa; ^meters are 
presented in columns five and six of Table 54. The peaked anchor 
tests performed most poorly according to both of these indices of 
error for the a parameters. The mean absolute error in estimating a ^ 
was .629 for the 25-item peaked test, and was as high as 1.169 for the jt 
5-item peaked test. The rectangular tests were best overall, but for 
15 and 25 items, the normal tests performed nearly as well. The least 
error was observed for the 25-item rectangular and normal tests. 
When no linking was performed at all, mean absolute error was .450. 
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All three peaked tests exceeded this value, but only the 5-item ver- 
sion of the normal and rectangular tests did. 

The pattern was identical for the root-mean-square error in 
the a parameters. That is, the peaked tests performed most poorly, 
and all three peaked tests exceeded the root-mean-square error of 
.602 which was observed in the no-linking condition. Again, the 
rectangular tests were best overall, but for 15 and 25 items, the 
normal tests performed nearly as well. The least error was observed 
for the 25-item rectangular and*normal tests. For all three kinds 
of anchor tests, both absolute and root-mean-square errors in the a 
parameters decreased with increasing anchor test size. 

The pattern of errors was somewhat different for the b param- 
eters. Overall, there were essentially no differences among the an- 
chor test types in mean absolute error; these values ranged from .248 
to .344 across the nine tests, and all these values were below the 
.364 observed in the no-linking condition. °For the peaked tests, 
mean absolute errors decreased with anchor test size as expected. 
For the rectangular and normaL tests, however, these errors increased 
with test size, as was observed for the bias statistics. 

The peaked tests were better, in general, than the other two 
kinds of tests in terms of root-mean-square errors in the b param- 
eters.' These values ranged from .344 to .430 and, although there 
was no trend observed with respect to anchor test size, all these 
values were below the .464 observed in the no-linkir^g condition. The 
normal tests were slightly superior to the rectangular tests In terms 
of root-mean-square error. In both cases, errors increased with in- 
creasing anchor test length. 

There were small differences observed across anchor tests in 
terms of the correlations between the true and estimated item param- 
eters. For the a parameters, these values clustered between .529 
and .532 for all nine anchor tests; all these correlations were lower 
than the .533 observed in the no-linking condition. There were no 
systematic trends observed with anchor test size. 

For the b parameters, these correlations were approximately .980 
for all nine tests, and therefore, all of thsm were higher than the 
.971 observed in the no-linking condition. 

Fidelity-of-estimatlon statistics for the heterogeneous condi- 
tion are presented in Table 55. As was observed for the homogeneous 
condition, bias in the mean a parameters was largest for the peaked 
tests and smallest for the rectangular tests; bias for the normal 
tests was only slightly larger than that for the rectangular tests. 
In the no-linking condition, bias in the mean a parameter was .138, 
which was exceeded by all the peaked tests and by the 5-item normal 
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Table 55. Item Parameter Error — Anchor Tests 
Heterogeneous Condition Using Systematically Sampled Examinees 



True Bias in Absolute RMS 



Method* Mean SD Mean SD Error Error R 

Normal 5 

a 1.586 .500 .571 .246 .714 .871 .513 

b .281 1.371 .113 -.084 .261 .317 .975 

Normal 15 

a 1.586 .500 .093 .082 .417 .552 .515 

b .281 1.371 .212 .285 .328 .511 .974 

Normal 25 

a 1.586 .500 .066 .075 .410 .544 .513 

b .281 1.374 .248 .313 .341 .535 .974 

Rectangular 5 

a 1.586 .500 .397 .193 .590 .738 .51? 

b . 28 i 1.374 .178 .029 .257 .363 .975 

Rectangular 15 

a 1.586 .500 .093 .085 .419 .554 .515 

b .281 1.374 .242 .284 .328 .511 .975 

Rectangular 25 

a 1.586 .500 .041 .065 .401 .536 .514 

b .281 1.374 .250 .338 .352 .550 .975 

D eaked 5 

a 1.585 .500 1.088 .431 1.161 1.355 .512 

b .281 1.374 .032 -.332 .347 .431 .974 

Peaked 15 

a / 1.586 .500 .615 .266 .750 .913 .513 

b .281 1.374 .110 -.107 .258 .341 .974 

Peaked 25 

a 1.586 .500 .455 .212 .628 .780 .511 

b .281 1.374 .155 -.005 .251 .353 .973 

No Linking 

a 1.586 .500 .138 .127 .455 .604 .484 

b .281 1.374 .146 .246 .368 .466 .971 
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and rectangular tests. These bias figures decreased with increased 
test length for all three anchor test types. 

Bias in the standard deviation of the a parameters was greatest 
for the peaked tests, ranging from .212 to 7431. There were only 
small differences between the normal and rectangular tests, with the 
slight advantage going to the rectangular test at the longer test 
lengths. The smallest biases were observed for the 25-item normal 
and rectangular tests. The bias in the no-linking condition, .127, 
was exceeded by all the peaked tests and the 5-item normal and rec- 
tangular tests. As before, all these bias figures decreased with in- 
creased test lengths. 

In terms of the bias in the mean b parameters, the peaked tests 
performed best, with bias equal to .032 for the 5-item test and in- 
creasing to .155 for the 25-item test. Bias In the mean b*s was some- 
what larger for the other two types of anchor tests, although there 
were fewer differences between them. For the normal and rectangular 
tests, the bias figures fell between ,143 and .250. All but one of 
these values were greater than the .146 observed in the no-linking 
condition. Only the 25-item peaked test exceeded this value. 

The standard deviations of the b parameters were consistently 
underestimated by the peaked tests; bias was as high as -.332 for the 
5-item test, but was only -.005 for the 25-item test. Bias values 
for the other two types of tests were essentially the same, with a 
slight advantage going to the normal test at the longer test lengths. 
In the no-linking condition, bias in the standard deviation of the b 
parameters was .246, which was exceeded by all but the shortest normal 
and rectangular tests and tne two longest peaked tests* 

The patterns of mean absolute and root-mean-squart. errors in the 
a and b parameters in the heterogeneous condition were identical to 
what was observed in the homogeneous condition. In terms of mean abso- 
lute error, the peaked anchor tests performed most poorly, with errors 
ranging from .628 to 1.161 for the a parameter. Again, the rectangular , 
tests were best overall, with the normal tests closely following. When 
no linking was performed at all, mean absolute error for the a param- 
eter was .455. All three peaked test exceeded this value, but only 
the 5-item normal and rectangular tests did. This pattern of the ab- 
solute errors was repeated for the root-mean-square errors. 

The pattern of errors in the b parameters for the heterogeneous 
case paralleled that observed in the b parameters for the homogeneous 
case. Overall, there were essentially no differences among the an- 
chor test types in Bean absolute error; all values were below the 
.368 observed in the no-linking condition. For the peaked tests, 
mean absolute errors decreased with anchor test si2e as expected. 
For the rectangular and normal tests, however, these errors increased 
with test size, as was observed for the bias statistics. 



The peaked tests were better, In general, than the other two 
kinds of tests in terms of root-mean-square en or for the b parameters. 
These values ranged <Vom • 34 1 to .431 and, although there was no trend 
observed with respect to anchor test size, all these values were below 
the .466 observed in the no-linking condition. The normal tests were 
slightly superior to the rectangular tests in terras of root-mean-square 
error. In both cases, errors inert 'sed with increased test length. 

Small differences were observed across anchor tests in terms of 
the correlations between the true and estimated item parameters. For 
the a parameters, these values clustered between .511 and .515, with 
the lowest correlations observed for the peaked tests. All these 
correlations were higher than the .484 observed in the no-linking 
condition. There were no systematic trends observed with anchor test 
size. 

For the b parameters, these correlations were between .973 and 
.975, with the lowest correlations again observed for the peaked tests. 
All these correlations were higher than the .971 observed in the no- 
linking condition. 

Characteristics of asymptotic ab i lity estimates . Table 56 pre- 
sents the summary characteristics of asymptotic ability estimates for 
the homogeneous case. Columns 1 and ' present the mean and standard 
deviation of the asymptotic ability metric. The peaked tests came 
closest to producing an ability metric with a mean of zero; this 
value Increased with increased test lengths. There were essentially 
nc differences observed between the normal and rectangular tests. 
For the normal tests, the means also Increased with increased test 
length; for the rectangular tests, the means decreased. 

The peaked tests performed most poorly In producing ability esti- 
mates with a standard deviation of 1.0. The rectangular tests produced 
estimates with a standard deviation closest to 1.0. For all three 
types of anchor tests, the standard deviation Increased with Increased 
test length. 

The no-linking condition produced estimates whose mean, .003* 
was closer to zero than were the means from any of the nine anchor 
tests. The standard deviation for the no-linking condition, 970, 
was exceeded only by the 25-item normal and rectangular tests. 

Although the estimates from the peaked tests had means closer to 
zero than* did the other anchor tests, the peaked ^est estimates had 
the highest mean absolute errors. The rectango ar cests had the 
smallest errors, but the errors for the normal teats were only slightly 
larger. Errors for all three peaked tests exceeded the value of .125 
observed In the no-linking condition. Only the 5-item normal and 
rectangular tests exceeded this value. In all cases, mean absolute 
error decreased with Increased test length. The pattern for the 
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Table 56. Asymptotic Ability Estimates— Anchor Tests 
Homogeneous Condition Using Systematically Sampled Examinees 



Method 


Mean 


SD 


Absolute 
Error 


RMS 
„rror 


R 


Normal 5 


.089 


.745 


.217 


.285 


.996 


Normal 15 


.091 


.955 


.095 


.144 


.996, 


Normal 25 


.092 


.971 


.091 


.140 


.996 


Rectangular 5 


.093 


.809 


.170 


.233 


.996 


Rectangular 15 


.093 


.955 


.097 


.146 


.996 


Rectangular 25 


.086 


.985 


.089 


.135 


.996 


rcdricu j 


.043 


.601 


.321 


.410 


.996 


Peaked 15 


.062 


.729 


.225 


.292 


.996 


Peaked 25 


.081 


.786 


.184 


.247 


.996 


No Linking 


.003 


.970 


.125 


.162 


.996 


root-mean-square errors 


in ability 


estimates was 


identical 


to that 



observed for the mean absolute error. 



The correlations between true and asymptotic ability were uni- 
formly .996 for the nine anchor tests, which is the same value ob- 
served when no linking was performed. 

The summary characteristics of the asymptotic ability estimates 
for the heterogeneous case are presented in Table 57. These summary 
statistics had mnch the same pattern as those of the homogeneous 
case. As in the homogeneous case, the peaked tests produced estimates 
with means closer to zero than did the other anchor tests; these means 
increased with increased test length. The means for the normal and 
rectangular tests were essentially the same, and clustered between 
.083 and .090; they did not vary systematically with test size. The 
standard deviations of ability estimates were smallest for the peaked 
* tests. Th«y were closest to 1.0 for the rectangular tests, although 
the standard deviations for the normal tests were only slightly lowsr . 

The no-linking condition produced estimates with a mean of 
-.013, closer to zero than any of the anchor tests. The standard 
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Table 57. Asymptotic Ability Estimates— Anchor Tests 
Heterogeneous Condition Using Systematically Sampled Examinees 



Method 


Mean 


SD 


Absolute 
Error 


RMS 
Error 


R 


Normal 5 


.086 


.742 


.216 


.284 


.996 


Normal 15 


.089 


.951 


.091 


.136 


.996 


Normal 25 


.089 


.967 


.091 


.132 


.996 


Rectangular 5 


.090 


.806 


.167? 


.231 


.995 


Rectangular 15 


.090 


.951 


.092 


.138 


.996 


Rectangular 25 


.083 


.982 


.085 


.126 


.996 


Peaked 5 


.041 


.598 


• 325 


.411 


.996 


Peaked 15 


.060 


.726 


.226 


.292 


.996 


Peaked 25 


.079 


.782 


.183 ' 


.245 


.996 


No Linking 


-.013 


.962 


.095 


.127 


.995 



deviation of estimates from the no-linking condition was .962; this 
was exceeded only by the 25-item normal and rectangular tests. 

As before, the peaked tests performed most poorly in terms of 
mean absolute en or, with values ranging from .183 to .325. The rec- 
tangular test performed slightly better than the normal test, al- 
though differences were small at the longer -test lengths. At test 
lengths of 15 or larger, mean absolute error was less than .092 for 
both the normal and rectangular tests; these were the only tests with 
mean absolute error below the .095 observed for the no-linking con- 
dition. Mean-absolute error decreased with increased test length. 

The pattern for root-mean-square error was similar. The peaked 
teste performed most poorly, with root-mean-square error from .245 to 
.411. The rectangular tests performed only slightly better than the 
normal tests, particularly at the longer test lengths. Under the no- 
linking condition, root-mean-square error was .127, which was ex- 
ceeded by all tests/except the 25-item rectangular test. 

The correlation between true and asymptotic ability was .996 in 
all cases but one; when no linking was done, this correlation was .995. 
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Efficiency of ability estimation . The relative efficiencies of 
the various anchor test linking procedures for the homogeneous case 
are presented in Table 58. Ttn average item information with the 
true Item parameters was .31*. This dropped to .278 with the estima- 
ted item parameters and, hypothetically, perfect linking. 



Table 58. Efficiency Analysis— Anchor Tests 
Homogeneous Condition Using Systematically Sampled Examinees 







Average 


Efficiency 


Relative to 




Method 


Item 
Information 


True 
Parameters 


Estimated 
Parameters 




True Parameters 
Est. PardWltfers 
Normal 5 


.314 
.278 
.274 


.887 
.875 


.986 






.275 


.877 


.988 




Normal 25 


.275 


.877 


.988 




Rectangular 5 


.274 


.873 


.984 




jngular 15 


.275 


.876 


.987 




Rectangular 25 


.275 


.876 


.987 




Peaked 5 


.274 


.875 


.986 




Peaked 'i5 


.275 


.876 


.987 




Peaked 25 


.275 


.876 


.987 


f 


No Linking 


, .266 


.849 


.957 



The efficiencies of these linking methods, relative to that 
achieved by using true parameters, clustsred between .873 an«: .887, 
with the highest f'gures observed for the normal tests. .With respec 
to the estimated parameters, the efficiencies of these anchor tests 
ranged from .984 to .988, with the normal tests being slightly supe- 
rior to the rest. All the3e values were higher than the .957 ob- 
served in the no-linking condition. 



The relative efficiencies of the various anchor test linking 
procedures are presented in Table 59 for the heterogeneous case. The 
average item information with the true item parameters was .305. 
This dropped to .271 with the estimated item parameters and perfect 
linking. 



Table 59. Ef fluency Analysis— Anchor Tests 
Heterogeneous Condition Using Systematically Sampled Examinees 



Average Efficiency Relative to 
Item ^ True Estimated 

Method Information Parameters Parameters 



True Parameters .305 

Est. Parameters .271 .889 

Normal 5 .261 .858 .965 

Normal 15 .262 .860 .968 

Normal 25 .262 .859 .967 

Rectangular 5 .261 .855 .962 

Rectangular 15 .261 .858 .966 

Rectangular 25 .262 .859 .966 

Peaked 5 .261 .857 .964 

Peaked 15 .262 .858 .966 

Peaked 25 .262 .859 .967 

No Linking .248 .814 .916 



The efficiencies of these linking methods, relative to that 
achieved by using true item parameters, clustered between .855 and 
.360. Once again, slightly higher figures were observed for the 
normal tests. With respect to the estimated parameters, the ef- 
, eian"tea of these nine anchor tests ranged from .962 to .968, with 
*.t normal tests being slightly superior to the rest. All these 
values were higher than the .916 observed in the no-llnking condition, 
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Reaul ts— Robust-Max Imum-I4kel lhood Scores 

In addition to the Bayeaian ability estimatea which were comput- 
ed for all airaulated examineea, maximum-likelihood eatimatea were 
computed for the examineea included in the calibration groups of 
1000. Identical analyaes of item parameter error, asymptotic ability 
estimates, and efficiency wer* computed for these estimates for the 
homogeneous condition. For direct comparison with the results ob- 
tained using the Bayesian scores, summary statistics for the Bayesian 
scores were recomputed using only the 1 ,000-examinee calibration 
groups. 

Fidelity of parameter estimation . Table 60 presents the com- 
bined results of item parameter error "far the maximum-likel ihood and 
Bayesian scores. For the maximum-likelihood scores, biases in the 
means of the a parameters were largest for the peaked tests and small- 
est for the rectangular tests although, again, differences between the 
normal and rectangular tests were small. All of the anchor tests ex- 
cept for the shortest two peaked tests, yielded smaller (in absolute 
value) bias figures than did the no-linking condition. Bias in the 
mean of the a parameters decreased with increased test lengths for 
the peaked tests, but no trends were observed with test lengths for 
the other anchor tests. 

The bias in the standard deviation of the a parameters was of 
approximately the same magnitude for all three anchor test types, 
and showed no consistent trends with test lengths. The no-linking 
condition yielded a bias of .112, which was exceeded only by the 
5-item tests. 

With respect to the Bayesian scores, the largest bias in the 
mean of the a parameters was also observed for the peaked tests, the 
smallest bias for the rectangular tests. In general, bias figures 
were larger for the Bayesian scores. Biases for the standard devia- 
tions of the a parameters for the Bayesian scores, however, were of 
approximately the same magnitude as those observed for the maximum 
likelihood scores, although the maximum-likelihood scores yielded 
somewhat smaller bias for the peaked tests. 

For the maximum-likelihood scores, the biases in the means of the 
b parameters were largest for the peaked tests, with small differences 
between the normal and rectangular tests. All of the bias values were 
larger than the .147 observed in the no-linking condition, although 
they all decreased with increased test lengths. Biases in the stand- 
ard deviation of the b parameters were largest for the peaked tests, 
and again, there were only small differences between the normal and 
rectangular tests. These values decreased with increased test length, 
and all were greater than the .228 obs—ved with no linking. 



-137- 



Table 60. Item Parameter Error — Anchor Tests 
Homogeneous Condition Using Systematically Sampled Examinees 



Method 



Bayeslan 



Bias in 



Mean 



RMS 

SD Error 



Maximum Likelihood 
Bias in RMS 
SD Error 



Mean 



Normal 5 
a 
b 



.575 .264 .906 .493 -.035 .248 .822 .329 
.114 -.091 .338 .980 .453 .599 .962 .946 



Normal 15 
a 
b 



.101 .100 .586 .489 -.003 .069 .594 .472 
.217 .258 .506 .980 .232 .353 .535 .981 



Normal 25 
a 
b 



.073 .089 .578 .489 .045 .081 .606 .479 
.222 .281 .517 .980 .217 .300 .488 .982 



Rect. 5 
a 
b 



.399 .202 .767 .491 .050 .191 .687 .423 
.149 .018 .350 .980 .285 .439 .740 .955 



Rect. 15 
a 
b 



.095 .096 ■ .58" .492 
.219 .260 .497 .980 



-.022 .066 
.249 .381 



.606 .474 
.560 .981 



Rect. 25 
a 
b 



.043 .080 .566 



.227 .314 



.544 



.491 
.980 



.037 
.213 



.079 
.308 



.598 
.490 



.479 
.982 



Peaked 5 
a 
b 



1.087 .447 1.384 .496 
-.007 -.324 .419 .980 



-1 .047 -.185 1 .182 .319 
1.964 4.508 5.075 .954 



Peaked 15 
a 
b 



.620 .281 .945 .494 
.072 -.116 .328 .980 



-.688 -.075 .880 .370 
1.100 2.050 2.508 .943 



Peaked 25 
a 
b 



.457 .226 .81 1 .492 .017 .074 .599 .467-' 
.123 -.017 .348 .980 .337 .327 .583 .980 



No Linking 
a 
b 



.143 .112 
.147 .228 



.629 
.444 



.501 
.973 



143 
147 



.112 
.228 



.629 
.444 



.501 
.973 
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Biases for the Bayesian scores were smaller, in general, than 
they were for the maximum-likelihood scores. They tended to increase 
with increased test lengths, and approximately half were smaller than 
the values observed with no-linking. 

For the maximum-likelihood scores, root-mean-square error in the 
a parameters was largest for peaked tests. The advantage of the rec- 
tangular tests was slight. There was no consistent trend with test 
length; about half of the values were smaller than the value of .629 
observed with no-linking. 

This same pattern of root-mean-square errors in the a parameters 
was observed for the Bayesian scores, and the magnitude of the errors 
was approximately the same for the two scoring methods.' 

Root-mean-square errors in the b parameters for the maximum- 
likelihood scopes were largest for the peaked tests, and the normal 
and rectangular te^ts performed equally well. There was a strong 
tendency for the root-mean-square error to decrease with increased 
test length, although all values were larger than the observed 
with no-linking. 

For the Bayesian scores, root-mean-square errors increased with 
test length for the normal and rectangular tests; the magnitude of 
the errors was much smaller for the Bayesian scores than for the max- 
imum-likelihood scores. 

The correlations between the true and estimated a parameters 
were smallest for the peaked tests and largest for the rectangular 
tests when using the maximum-likelihood scores. When the Bayesian 
scores were used, all the anchor tests produced correlations which 
were of approximately the same magnitude, and consistently higher 
than those observed for the maximum-likelihood scores. 

For the maximum-likelihood scores, the correlations between true 
and estimated b parameters were of about the same magnitude for all 
the anchor tests, with the 15-item peaked test performing worse than 
would otherwise have been expected. For the Bayesian scores, these 
correlations were' uniformly .980 for all nine anchor tests. 

Characteristics of asymptotic ability estimates . Table 61 pre- 
sents the summary statistics for the asymptotic ability estimates with 
maximum-likelihood and Bayesian scoring. When maximum-likelihood 
scores were used, the 5-item normal and all of the peaked anchor tests 
produced means somewhat deviant from zero. The remaining anchor tests 
produced means near .1. The no-linking procedure produced a mean of 
.03U t better than that produced by any of the linking procedures. 

The linking procedures did a better job of producing est mates 
with * mean of zero when these estimates were scores computed *H t.h a 
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Table 61. Asymptotic Ability Estimates— Anchor Tests 
Heterogeneous Condition Using Systematically Sampled Examinees 



Bayesian Maximum Likelihood 



Method 


Mean 


SD 


RMS 
Error 


R 


Mean 


SD 


RMS 
Error 


R 


Normal 5 


.092 


.739 


.290 


.996 


.225 


1.027 


.285 


.995 


Normal 15 


.091 


.945 


.143 


.996 


.098 


1.023 


.147 


.996 


Normal 25 


.092 


.962 


.138 


.9,96 


.098 


.995 


.147 


.996 


Rect. 5 


.092 


.805 


• 235 


.996 


107 




1 lift 


007 


Rect. 15 


.093 


.950 


.143 


.996 


.117 


1 .044 


.172 


.996 


Rect. 25 


.034 


,979 


.130 


.996 


.088 


.997 


.138 


.996 


Peaked 5 


.040 


.597 


.412 


.996 


.259 


2.694 


1.781 


.980 


Peaked 15 


.058 


.723 


.295 


.996 


.49C 


1.796 


.956 


.997 


Peaked 25 


.079 


.780 


.249 


.996 


.204 


1 .008 


.233 


.996 


No Linking 


.034 


.962 


.133 


.996 


.034 


.962 


.133 


.996 



modal Bayesian algorithm. No mean was larger than .093. This was not 
surprising since the Bayesian algorithm explicitly regressed estimates 
toward zero. Again, there were but slight differences between the 
normal and rectangular tests % This time, however, the peaked tests 
performed best, with means between .040 and .079. Even these, how- 
ever, were still larger than that obtained by not linking at all. 
Neither data set revealed a trend toward decreasing means with in- 
creased test length. 

The normal and rectangular tests, coupled with maximum-likeli- 
hood scoring, produced estimates whose standard deviations were close 
to 1.0, typically between .965 and 1 ; 0UU, with slightly "better" 
estimates produced using the normal tests., The peaked tests produced 
estimates with ^andard deviations quite large, at least for the 5- 
and 15-itera tests. The longest peaked test, and all the normal and 
rectangular tests, produced estimates with standard deviations closer 
to 1.0 than was observed with no-linking. 

With the Bayesian scores, ability estimates w«re systematically 
less variable, as would be expected from a procedure which regressed 
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all estimates away from the extremes. The peaked test produced esti- 
mates less variable than the others; no standard deviation here was 
greater than .780. Although the differences were minor, the rectangu- 
lar test produced estimates with standard deviations closer to 1.0 
than did the normal te*t. Still, the no-llnklng value of .962 was 
exceeded only by the 2t>-ltem rectangular test. 

There were few differences between the scoring procedures In 
terms of mean absolute and root-raean-square errors. For both proce- 
dures, the normal and rectangular tests performed best, with a slight 
advantage given to the rectangular test. Overall, the Bayeslan scores 
performed slightly better than did the maxlmum-llkellhood scores. In 
both cases, Jthe peaked tests performed worst, although here the dif- 
ference was much more marked for the maxlmum-llkellhood scores. Only 
for the 25-ltem rectangular test with Bayeslan scores did the errors 
ever drop below the 'level observed with no-llnklng. 

All the correlations between true and estimated ability cluster- 
ed near .996 when Bayeslan scoring was used. These correlations were 
more variable with maxlmum-llkellhood scoring and, for the peaked and 
rectangular anchor tests, showed a slight decrease with Increasing 
anchor-test length. 

Efficiency of ability estimation . Table 62 presents the effi- 
ciency figures for the maxlmum-llkellhood and Bayeslan scores. For the 
Bayeslan estimates, average ii,em information was essentially .267 for 
all nine anchor test conditions. For the maxlmum-llkellhood scores, 
this level was oot reached until the 15-ltem normal and rectangular 
anchor tests were used; for the peaked test, 25 Items were necessary. 
For the Bayeslan scoring, efficiencies were essentially the same for 
the three anchor test types, and these values Increased only slightly 
with test length. All were above the level achieved in the no-llnking 
condition. For the maxlmum-llkellhood scores, the efficiencies were 
generally lower than for the Bayeslan scores, even at the longest test 
lengths All of the 5-ltem tests performed poorly, as did the 15-item 
peaked test. Efficiency, with respect to the estimated parameters, 
increased with test length, but still half the tabulated entries were 
below the value of .964 achieved with no linking. 

Discussion 

The data on anchor-test linking methods can be summarized rather 
briefly since there were several distinct trends with few exceptions. 
In terms of parameter bias, the peaked tests performed most poorly, 
often yielding large errors In parameter and ability estimation. 
There were few consistent differences noted between the normal and 
rectanrflilar tests, especially for longer tests, although at the 
shorter test lengths, the rectangular test was usually superior. 
Differences among the test types tended to fade when the criterion 
was no longer bias but was the correlation between true and estimated 
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Table 62. Efficiency Analysis — Anchor Tests 
Homogeneous Condition Using Systematically Sampled Examinees 







Bayesian 




Maximum Likelihood 






Efficiencies 
Relative to 




Efficiencies 
Relative to 


Method 


Avg. Item True 
Info. Params. 


Est. 
Params. 


Avg. Item 
Info. 


True 
Params. 


Est. 
Params . 


True Params. 


.306 






.306 






Est. Params. 


.270 


.882 




.270 


.882 




Normal 5 


.267 


.872 


.989 


.235 


.770 


.873 


Normal 15 


.267 


.873 


.990 


.266 


.870 


.987 


Normal 25 


.267 


.871 


.992 


.267 


.872 


.989 


Rect. 5 


.266 


.871 


.988 


.254 


.831 


.943 


Rect. 15 


.267 


.873 


.990, 


.265 


.867 


.983 


Rect. 25 


.267 


.873 


.990 


.266 


.870 


.986 


Paaked 5 


.267 


.871 


.988 


.227 


.741 


.841 


Peaked 1 r 


.267 


.873 


.991 


.249 


.813 


.922 


Peaked 25 


.267 


.874 


.991 


.266 


.869 


.986 


No Linking 


.260 


.850 


.964 


.260 


.850 


.964 



parameters or true and estimated ability. Differences among the test 
types also disappeared when their relative efficiencies were taken as 
the criterion. 

Anchor test length was a salient factor when one investigated 
the errors of a-parameter and ability estimation. Across test types, 
there were only small differences observed between the 15- and the 
25-item tests; the 5-item tests were typically muoh worse than the 
others. The trend toward decreasing errors with increasing test 
lengths was expected, but was observed only for the a parameters. 
For the b parameters, this trend was reversed, with smaller errors 
ob:»rved with the shorter tests. 
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The test length effects disappeared when correlations and effi- 
ciencies rather than biases and errors were considered. 

When comparisons were made between the Bayesian and the maximum- 
likelihood scores, the former were consistently better based on all 
the criteria used in this research. 



Conclusions 



Data presented in this section of the report provided the first 
opportunity to compare all four linking methods. In an effort to a- 
void confusion, only data relevant to the conclusions drawn are pre- 
sented Since the parameter-error statistics bear little direct re- 
lation to the utility of the linked items, they will not be discussed. 

In terms of capacity to produce an asymptotic metric with the 
correct mean, the anchor-group method was generally superior. In 
nearly all configurations investigated, the anchor-group method pro- 
duced a mean correct to the second decimal place. The Bayesian 
equivalent-tests method produced the most devian- mean. Asymptotic 
means for each^of the methods were essentially equivalent in the 
homogeneous and heterogeneous conditions. 

The most accurate asymptotic standard deviations were 'produced 
by the anchor- test method. With a 25-item rectangular anchor test, it 
produced an asymptotic standard deviation within .015 of the true 
value. In less favorable configurations, however, it produced stand- 
ard deviations .4 unit in error. The equivalent-tests procedure pro- 
duced results nearly as good as the best anchor-test configuration. 
The equivalent-groups and anchor-group procedures produced results 
somewhat less accurate. 

Using root-mean- square error as a composite error-of-metr ic 
index, the anchor-group and anchor-test methods produced the least 
error and were approximately equivalent. The equivalent-tests method 
produced the most error. » 

Viewed in terms of linking efficiency, the anchor-test method 
produced the most efficient item pools. Its efficiencies ranged, from 
.986 to .988 in the homogeneous • condition and from .965 to .967 in 
the heterogeneous condition. Configured properly, the anchor group 
procedure resulted in equivalent efficiencies, but with smaller groups, 
the efficiency dropped somewhat. The equivalent-tests method produced 
efficiencies slightly lower than the least efficient of the two anchor 
procedures. The equivalent-groups method, whose assumptions were vio- 
lated by these data, produced efficiencies slightly lower than those 
of the equivalent-tests procedure. 
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Although not considered in the previous discussion, the no-link- 
ing condition should not be forgotten. In terms of errors in the 
asymptotic distribution, it produced parameters as good as those pro- 
duced by the best of the other methods. Its efficiencies were some- 
what lower than those of the equivalent-groups procedure, however. 

Use of the maximum-likelihood scoring procedure with the anchor- 
group or anchor-test procedures did not seem to be warranted by the 
data. In addition to producing less efficient item pools than did 
the Bayesian scoring procedure, this procedure appeared to bias the 
asymptotic metric more severely. Since it was investigated primarily 
as a means of, reducing bias in the metric, these results suggest that 
it is not a useful scoring procedure for linking in the environment 
investigated here. 

Neither of the anchor methods were evaluated in the randomly 
sampled data set because their performance in that set was assunted to 
be equivalent to thejfr performance in the systematically sampled data 
set. The same assumption was reasonable for the equivalent-tests 
method but that method was* nevertheless, evaluated in both sets and 
thus provides a test of the assumption. In this data set the equiva- 
lent-tests method , produced parameters with root-mean-square errors of 
.356 and . .231 in the homogeneous and heterogeneous conditions, respec- 
tively, and efficacies of - 971 and * 9U 9- Tn the randomly selected 
data set, corresponding values were .209, .143, .962, and .944. The 
asymptotic error statistics appeared somewhat smaller in the randomly 
sampled condition but the efficiencies were comparable. 

' Efficiencies for tne Bayesian equivalent-groups procedure were 
.988 and .973 for the homogeneous and heterogeneous condftions, 
respectively. These efficiencies compare very favorably with .988 
and .968, the best efficiencies obtained by any method in the sys- 
tematically sampled data 3 set. This suggests that, if examinees are 
randomly sampled from the population of interest, the Bayesian 
equivalent-grotfps procedure can produce item pools as efficient as 
any of the more complicated methods. 
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VI. LINKING WHEN EXAMINEES ARE SELECTED 



Investigations of linking discussed in previous chapters were 
limited to populations that could, more or less, occur in nature. No 
explicit selection had been done in defining the population and the 
distributions of abilities were essentially symmetric. The research 
discussed in this section of the report dealt with a selected popula- 
tion, The examinee samples used were those of the ^elected data set 
described in an earlier section. Briefly, the upper two-thirds of a 
sample were selected, on the basis of number-correct scores, to simu- 
late selection that occurs in Air Force recruits. The procedure was 
very similar to that used by Ree (1978). 

The selected data set contained only one row of the matrix of 
test lengths and sample sizes corresponding to a sample size of 1,000. 
This restriction of the data set was done primarily to save computer 
costs since adequate data regarding the joint effects of test length 
and sample size had been collected and discussed in earlier sections 
of this paper. Since the entire matrix was not available, only the 
homogeneous analyses were done. 

<.> 

Equivalence Methods 

Procedure > 

The equivalence linking procedures used on the selected data set 
were similar in form to those used in previous sections; the same 
equations were used to perform the linking. Because of findings of 
previous sections, however, only the modal Bayesian scoring method 
was used for equivalent-groups linking. The remaining five linking 
methods were not used. The equivalent-tests and no-linking proce- 
dures were the same as before. 

Results 

Fidelity of parameter estimation . Table 63 presents fidelity- 
of-estimation statistics for the homogeneous condition using selected 
examinees. Coltnns one and two present means and standard deviations 
of the true a and b parameters for the items used with the selected 
data set. As was the case with items used in previous data sets, no 
notable departures from the population values were observed. 

Biases in the parameter estimates are presented in columns three 
and four. The a-parameter means were essentially unbiased for the 
equivalent-tests and no-linking procedures. The a parameters were 
underestimated by .335 units when the Bayesian equivalent-groups pro- 
cedure was used. The equivalent-tests procedure produced b parameters 
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Table 63. Item Parameter Error— Equivalence Methods 
Homogeneous Condition Using Selected Examinees 





True 


Bias 


in 


Absolute 


RMS 




Method 


Mean 


SD 


Mean 


SD 


Error 


Error 


R 


Equiv. Groups 
a 
b 


1.601 
.176 


.501 
1 .340 


-.335 
-.530 


-.008 
,843 


.476 
.893 1 


.624 
. 102 


.466 
.974 


Equivalent Tests - 
a 
b 


1.601 
.176 


.501 
1 .340 


-.015 
.051 


.112 
.390 


.444 
.456 


.589 
.622 


.458 
. 968 


No Linking 
a 
b 


• 

1.601 
.176 


.501 
1.340 


-.015 
-.378 


.112 
.400 


.491 
.522 


.651 
.657 


.465 
.975 



with nearly the correct mean. The other two procedures produced under 
estimates of the lb Parameters. 

The Bayesian equivalent-groups procedure produced £ par^eters 
with nearly the correct standard deviation. Standard deviations of 
the a parameters were slightly greater than the correct values for the 
other two methods. All linking procedures produced ^-parameter stand- 
ard deviations that were larger than those jf the true parameters. 
The equivalent-groups procedure produced the largest standard devia- 
tions. 

Columns five and six present absolute and root-mean-square 
errors of parameter estimation. Errors in a-parameter estimates were 
approximately equal for all methods. The equivalent-tests method 
produced the least error and the no-linking procedure produced the 
most. Errors in the b parameters were about equal for the equivr- 
lent-tebts and no-linking procedures. The equivalent-groups pro- 
cedure produced b-param^ter errors substantially greater than those 
produced by the other procedures. 

Ccrr "ations between true and estimated parameters are presented 
in the last column of the table. The equivalent-groups ^nd no-link- 
ing procedures were trivially different in terms of this correlation. 
The equivalent-tests procedure produced correlations somewhat lower 
than the other two procedures. 

Characteristics of asymptotic ability estimates. Table 64 pre- 
sents statistics descriptive of asymptotic ability estimates. These 
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Table 64. Asymptotic AbiUty Estimates — Equivalence Methods 
Homogeneous Condition Us^ng Selected Examinees 



Absolute RMS 



Method 


Mean 


SD 


Error 


Error 


R 


Equiv. Groups 


-.813 


1.565 


.823 


1 .000 


.996 


Equivalent Tests 


-.156 


1 .250 


.265 


.369. 


.996 


No Linking 


-.566 


1.?65 


.566 


.642 


.996 



statistics should be interpreted relative to a standard normal popu- 
lation even though the items were calibrated on a population distinct- 
ly different. The first column presents asymptotic means resulting 
from application of the items to a standard normal population. All 
procedures resulted in net underestimates of abilities. The equiv- 
alent-tests procedure produced the mean closest to the true value of 
zero, and the equivalent-groups procedure produced theone most devi- 
ant. 

Asymptotic standard deviations are presented in the second 
colunn. All three linking procedures produced estimates that were 
quite deviant from the mean. The equivalent-groups procedure pro- 
duced the most deviant estimates ♦ however, and the other two methods 
produced estimates about equally deviant. 

Absolute and root-mean-square errors of the asymptotic estimates 
are presented in columns three and four. The equivalent-tests proce- 
dure produced the least error, according to both statistics, and the 
equivalent-groups procedure produced the most error. 

Column five presents correlations between true and asymptotic 
ability estimates. All three procedures resulted in correlations of 
.996, indicating that the regressions were about equally linear. 

Efficiency of ability estimation . Table 65 presents calibration 
and linking efficiencies for the selected data set. As was true of 
corresponding tables in previous sections, columns two and three are 
simply manipulations of the data in column one and column three is 
most informative relative to linking efficiency. As can be seen from 
coluon three, linking efficiencies of the equivalent-groups and no- 
linking procedures were equal. Th* linking efficiency of the equiv- 
alent-tests procedure was somewhat lower 

Linking efficiencies were quite high for all methods. These 
figures are not, however, directly comparable to those from previous 
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Table 65. Efficiency Analysis— Equivalence Methpds 
Homogeneous Condition Using Selected Examinees 





Average 


Efficiency Relative to 


Method 


Item 

Tnformafr ion 

nil vi ma wivu 


True 
Parameters 


Estimated 
Parameters 


True Parameters 


.325 






Est. Parameters 


.268 


.824 




Equlv. Groups 


.265 


.814 


.988 


Equivalent Tests 


.262 


.807 


.979 


No Linking 


.265 


.814 


.988 



data sets because these figures represent averages of only four cells 
rather than the 12 represented in previous tables. 



Anchor Group Method 



Procedure 



The anchor-group linking procedure used for the selected data 
set was essentially the same as that used for the systematically 
sampled data set. The modal Bayesian scoring procedure was used 
throughout this section, as the maximum-likelihood procedure demon- 
strated no distinct advantages in previous analyses. Details of the 
linking procedure were presented in the previous section and will not 
be repeated /here. 

Results 

Fidelity of parameter estimation . Table 66 presents parameter 
error for the anchor-group design in the selected data set. Bias in 
the estimates of the mean a parameter was positive for the normal 
group (indicating overestimates) and slightly negative for the uni- 
form group (indicating underestimates). Bias tended to decrease 
with increasing anchoi group size for both normal and uniform groups. 
Bias in the standard deviation of the a parameters showed the same 
trends as the means. Bias tended to decrease with increasing anchor 
group size and was smaller for the uniform group than for the normal 
group. The no-linking condition very slightly underestimated the 
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Table 66. Item Parameter Error— Anchor Groups 
Homogeneous Condition Using Selected Examinees 



True Bias in Absolute RMS 
Method Mean SD Mean SD Error Err£L 



Normal 10 
a 

b- 

Normal 30 
a 



1.601 .501 .220 .213 .536 .703 .466 

.176 1.340 .063 .182 .306 .429 .972 

1.601 .501 .181 .192 .517 .682 , .464 

b .176 1.340 .04U .205 .309 .429 .973 



Normal 50 
a 
b 

Normal 100 
a 
b 

Uniform 10 
a 
b 

Uniform 30 
a 
b 

Uniform 50 
a 
b 



1.601 .501 .163 .187 .505 .672 .465 
.176 1.340 .060 .221 .315 .434 .974 



1.601 .501 .144 .179 .503 .666 .467 
.176 1.340 .043 .243 .321 .440 .974 



1.601 .501 .129 .184 .492 .657 .456 
.176 1.340 .030 .262 .348 .508 .972 



1 601 .501 -.010 .125 .448 .601 .461 
'.176 1 .340 .065 .395 .425 .577 .974 



1.601 .501 -.005 .123 .460 .609 .464 
.176 1.340 .057 .388 .417 .548 .974 



Uniform 100 ^ _^ ^ ^ ^ ^ 

b .176 1.340 .055 .401 .425 .561 .974 

N0L1 f n8 1.601 .501 -.015 .112 .491 .651 .465 

b .176 1.340 -.378 .400 .522 .657 .975 



a-parameter mean and showed less bias in the a-parameter standard de- 
viations than did any of the linking methods. 

The biases in the means of the b parameters were very much alike 
for both anchor groups, but the no-linking condition substantially 
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underestimated the mean. Bias in the standaH deviation of the b 
parameters revealed a tendency for increasing Has with increasing 
anchor group size for both normal and uniform groups. The normal 
group, however, showed smaller bias in standard deviation than the 
uniform group, while the no-linking method had one of the largest 
biases in standard deviation. 

Absolute and root-mean-square error for the a parameter showed a 
decreasing trend with increasing anchor group size* for the normal 
groups. The uniform groups showed less error than the normal groups 
overall. The no-linking group showed errors midway between the uni- 
form and normal groups. 

Errors in the b parameters followed the opposite trends noted 
for the a-pararaeter errors, errors increased with increasing anchor 
group size and error was less for uniform groups than for normal 
groups. The no-linking group showed the greatest b-parameter error. 

Correlations between true and estimated parameters tended to in- 
crease with increasing anchor group size and to be somewhat higher in 
the normal groups than in the uniform groups for the a parameter. 
For tne b parameters, there were negligible differences between the 
g.oups. The correlation between true and estimated a parameters in 
the no-linking group was comparable to that observed "in the normal 
and uniform groups and the b-parameter correlation in the no-linking 
group was the highest of all groups. 

Characteristics of asymptotic ability estimates . Table 67 pre- 
sents descriptive statistics, for asymptotic ability estimates for 
each anchor group in the selected data set. Column one, showing th* 
means, indicates that parameters linked using normal or using uniform 
anchor groups tended to underestimate the population mean of zero^ 
The normal groups appeared to have oloser estimates than the uniform 
groups over all group sizes, while the no-linking condition showed 
the greatest deviation from zero. There were no apparent trends 
with respect to increasing anchor group size. 

Standard deviations were somewhat higher than the population 
value of 1.0 and showed a trend for Increasing values as the anchor 
group size Increased. The normal groups produced standard deviations 
closer to 1.0 than did the uniform groups, and the no-linking condi- 
tion produced t'e largest standard deviation. 

Absolute and root-mean-square error, presented in columns three 
and four, showed a tendency to Increase with Increasing anchor group 
size and to be larger for uniform than for normal groups. No-linking 
produced the largest errors. 

There were no differences across g oup composition or group size 
In terms of the correlation of the true with the asymptotic ability 
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Table 67. Asymptotic Ability Estimates— Anchor Groups 
Homogeneous Condition Using Selected Examinees 



Absolute RMS 



Method 


Mean 


SD 


Error 


Error 


R 


Normal 1 0 


-.084 


1.081 


.119 


.161 


.996 


Normal 30 


-.109 


1.111 


.130 


.185 


.996 


Normal 50 


-.094 


1.118 


.128 


.181 


.996 


Normal 100 


-.lib 


1.131 


.143 


.203 


.996 


Uniform 10 


-.143 


1.146 


,.168 


.236 


.996 


Uniform 30 


-.130 


1.241 


.217 


.295 


.996 


Uniform 50 


-.136 


-.236 


.217 


.294 


.996 


Uniform 100 


-.138 


1.2U4 


.222 


.299 


.996 


No Linking 


-.566 


1.265 


.566 


.642 


.996 



estimates. Ml correlations, including the no-linking group, were 
uniformly .996. 

E fficiency of ability estimation . Table 63 presents the average 
item information and relative efficiencies for the anchor-group link- 
ing method. The efficiencies relative to the estimated parameters, 
shown in column three, revealed a flight tendency to increase as 
anchor group size increased. The normal groups showed an almost 
trivial advantage over the uniform groups, while the no-Unking con- 
dition shpwea the highest efficiency. 



Pi scussl on 

Much of the information presented thus far has been less than 
definitive. Different analyses suggested different interpretations. 
Fidelity analyses, for example, suggested that anchor groups using a 
uniform distribution yi^eld less parameter error than those using a 
normal distribution. Asymptotic ability statistics suggested that a 
normally distributed sample yields results superior to those of a 
uniform distribution. Efficiency analyses, on the other hand, showed 
both normal and uniform anchor groups to have about the same effi- 
ciency. 
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Table 68, Efficiency Analysis — Anchor Groups 
Homogeneous Condition Using Selected Examinees 



Average 1 Efficiency Relative to 
I tern Tr ue Es t im a ted 

Method Information Parameters Parameters 



True Parameters 


• 325 






Est. Parameters 


.263 


.824 




Normal 10 


.263 


.810 


.983 


Normal 30 


.265 


.813 


.987 


Normal 50 


.265 


.813 


.987 


Normal 100 


.265 


.813 


.987 


Uniform 10 


.263 


.809 


.982 


Uniform 30 


.263 


.810 


.983 


Uniform 50 


.263 


.810 


.983 


Uniform 100 


.264 


.812 


.986 


No Linking 


,265 


.814 


.988 











Results of the efficiency analysis for the anchor-groups proce- 
dure were especially noteworthy in view of the rather large discrep- 
ancy between the distributions of ability used in the anchor groups 
and those used in the calibration samples. The anchor groups had 
abilities wittt a mean of zero and a standard deviation of one. The 
selected examinees in this data set had a mean greater than zero and 
a standard deviation less than one. 

Although the no-linking condition showed the highest efficiency, 
the b-parameter mean and asymptotic ability mean were quite deviant 
from their true values. The reason the efficiency of the no-linking 
condition did not reflect these deviant parameter estimates is be- 
cause efficiency statistics, like correlations, are insensitive to 
linear transformations of the data. If, however, an attempt was made 
to link items calibrated on groups widely different in ability (verti- 
cal equating), the no-linking orocedure would show much lower effi- 
ciencies because each set of items would tend to shift the scale 
closer to its own metric. 
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As discussed earlier, efficiency analyses are the most appropri- 
ate evaluative criteria to apply to the linking procedures. The 
efficiency analyses suggested the following observations: (a) group 
composition tended to make very slight differences in observed 
efficiency, (b) there was a tendency for higher efficiency as test 
length increased and anchor group size increased, the latter being 
less pronounced than the former, and (c) increasing anchor group size 
did not substantially increase the efficiency. 



Procedure 

The anchor-test linking procedures used for th^> selected data 
set presented in this section were identical to those used for the 
randomly and the systematically sampled data sets, Details of these 
linking procedures were presented earlier and will not be repeated 
here. Analyses were performed only for the condition where the items 
were originally calibrated on 1,000 cases for four- different test 
lengths. Only the homogeneous condition is presented here. Modal 
Bayesian ability estimates were used throughout. 



Fidelity of parameter estimation . Fidelity-of-estimation stat- 
istics for the homogeneous condition are presented in Table 69. All 
of the anchor test procedures overestimated the a^ parameters, although 
this bias systematically decreased with increased anchor-test lengths. 
The smallest biases in the mean of the a parameters were observed for 
the rectangular tests, although at the longer test lengths the normal 
tests produced biases nearly as small. Much larger biases were ob- 
served for the peaked tests at all three test lengths. When no link- 
ing was performed on the data, bias in the mean of a parameters was 
-.015. This figure was exceeded by all nine anchor test methods. 

Biases in the standard deviations of the a parameters were larg- 
est for the peaked tests. There were few differences observed in the 
biases for the normal and rectangular tests. All the biases system- 
atically decreased with increased test length. In the no-linking 
condition, bias in the standard deviation of the a parameters was 
.112. This figure was exceeded by all nine anchor test methods. 

All anchor test methods produced b-parameter estimates that were 
essentially unbiased in their means. The largest bias observed, 
-.082, was quite small. The no-linking group produced considerable- 
bias, by comparison. This was expected, however, as the mean ability 
levels of the calibration groups were substantially above zero. 



Anchor Test Method 



Results 
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Table 69. Item Parameter Error — Anchor Tests 
Homogeneous Condition Using Selected Examinees 



True Bias in Absolute RMS 



Method Mean SD Mean SD Error Error 



Normal 5 

a 1 .601 .501 ' .617 .366 .79*4 .998 .466 

b .176 1.340 -.030 -.088 .262 .353 .973 

Normal 15 

a 1.601 .501 .181 .194 .514 .672 .466 

b .176 1.340 .037 .219 .317 .450 .973 

Normal ?5 

a 1.601 .501 .156 .188 .506 .662 .467 

b .176 1.340 .050 .241 .329 .464 .973 

Rectangular 5 

a 1.601 .501 .552 .337 .744 .939 .466 

b .176 1 . 3**0 -.007 -.054 .252 .344 .974 

Rectangular 15 

a 1.601 .501 .188 .197 .518 .677 .466 

b .176 1.340 .044 .211 .313 .t»t»5 .973 

Rectangular 25 

a 1.601 .501 .123 * .174 .493 .646 .467 

b .176 1.340 .055 .273 .347 .489 .973 

Peaked 5 

a 1 .601 .501 1.-192 .588 1 .273 1 .541 .465 

b .176 1.340 -.082 -.346 .344 .462 ' .973 

Peaked 15 

a 1 .601 ,.501 .748 .416 ..596 1.M3 .465 

b .176 1.340 -.033. -.157 .271 .367 .973 

Peaked 25 

£ 1.601 .501 .566 .345 .755 .951 .466 

b .176 1.340 -.002 -.057 .257 .353 .973 

No Linking 

- , a 1.601 .501 -.015 .112 .491 .651 .465 

! b .176 1.340 -.373 .400 .522 .657 .975 
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As was observed for the b-parameter means, all three peaked 
tests underestimated the b-parameter standard deviations; this bias 
decreased with increased test length. Biases in the standard devia- 
tion of the b parameters were of approximately equal magnitude for 
the normal and rectangular tests. Except at the 5-item test lengths, 
this bias was positive; for both the normal and rectangular tests, 
bias increased with test length. All of the anchor tests produced 
biases smaller than that observed for the no-linking condition. 

.Mean absolute and root-mean-square errors in the parameters 
are presented in columns five and six of Table 69. The peaked an- 
chor tests performed most poorly according to both of these indices 
of error for the a parameters. In general, errors for the rec- 
tangular tests were smaller than for the normal tests although, as 
before, these differences were smaT . Both indices of error de- 
creased with increased test length, In most cases, the no-linking 
condition yielded smaller absolute and root-mean-square errors in 
the a parameters than did any of the anchor test conditions. 

Overall, the magnitude of absolute and root-mean-square errors 
in the b parameters was approximately equivalent (or all three types 
of anchor tests. Both types of errors decreased with increased test 
length for the peaked tests, but increased with test length for the 
normal and rectangular tests. The no-linking procedure yielded 
larger absolute and root-mean-squ'are errors in the b parameters than 
did any of the anchor-test methods. 

The anchor-test-method correlations between true and estimated 
a parameters clustered between .465 and .467; .for the no-linking 
condition, this value was .465. The anchor-test correlations for 
the b parameter? were almost uniformly .97': (the correlation for the 
5-item rectangular test was .974), slightly lower than the value of 
.975 observed with no linking. 

Characteristics of asymptotic ability estimates . Table 70 pre- 
sents the summary characteristics of asymptotic aMlity estimates for 
the homogeneous case. Columns one and two present the means and 
standard deviations of the asymptotic ability metric. All of the 
anchor tests produced means slightly below tbe targeted zero. None 
of the three test types produced means consistently closest to zero 
but the normal tests corsistently produced means most deviant. Dif- 
ferences among these means were small, however. Means consistently 
decreased with test length for the rectangular tests and increased for 
the others. The no-linking procedure produced a mean much more 
deviant from zero than did any of the anchor-test methods. 

All of the peaked tests produced ability estimates with standard 
deviations less than 1.0. The 5-item normal and rectangular tests 
did likewise. The longer normal and rectangular tejts produced esti- 
mates with standard deviations greater than 1.0. In all cases, the 



Table 70. Asymptotic Ability Estimates— Anchor Tests 
Homogeneous Condition Using 'Selected Examinees 



Method 




Mean 


SD 


Ah <m 1 lit" p 

Error 


RMS 
Error 


R 




Normal 5 




-.117 


.892 


.135 


.188 


.996 




Normal 15 




-.115 


1.111 


.130 


.195 


.996 




Normal 25 




-.107 


1.126 


.133 


.198 


.996 




Rectangular 


5 


-. 102 


,.918 


.115 


.164 


.996 




Rectangular 


15 


-.107 


11.105 


.125 


.186 


.996 




Rectangular 


25 


-.110 


1 . 148 


. 146 


.215 


.996 




Peaked 5 




-.116 


.709 


.230 


.325 


.996 




Peaked 15 




-.106 


.843 


.145 


.213 


.996 




Peaked 25 „ 




-.097 


/ .913 


.113 


.165 


.996 




No Linking 




-.566 


/ 1.265 


.566 


.642 


.996 





standard deviations of ability estimates increased with anchor test 
length. The standard deviati0n of the no-linking condition was 1.265, 
a v$lue further from 1.0 th^rt was produced by any of the anchor tests. 

Mean absolute and ptfot-mean-square errors in the ability metric 
are presented in columns three and four of Table 70. The magnitude 
of absolute error was approximately the same across the three types 
of anchor tests, with a tendency for the smallest peaked test to 
produce errors larger th*n the rest. Mean absolute errors increased 
.with test length for the rectangular tests, and decreased with test 
length for the peaked tests. For the normal tests, these errors did 
not vary systematically with test length. Mean absolute error in the 
no-linking condition was much higher than that observed for any of 
the anchor tests. Exactly the same patterns were observed for the 
root-mean-square errors in the ability estimates. 

The correlation between true and estimated ability was uniformly 
.996 for all the anchor tests and for the no-linking procedure. 

Efficiency of ability estimation . Information and the relative 
efficiencies for the anchor- test procedures for the homogeneous case 
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are presented in Table 71. The average item information with the 
true parameters was .325. This dropped to .268 with the estimated 
parameters and, hypothetically, perfect linking. The average Item 
information with the ajichor-test procedures and with no-linking was 
.265. 



Table 71. Efficiency Analysis— Anchor Tests 
Homogeneous Condition Using Selected Examinees 



Average Efficiency Relative to 
Item True Estimated 

Method Information Parameter's Parameters 



True Parameters 


• 325 






Est. Parameters 


.268 


.824 




Normal 5 


.26U 


.813 


.987 


Normal 15 


.265 


.815 


.989 


Normal 25 


.265 


.81U 


.988 


Rectangular 5 J* 


.265 


.815 


.989 










Rectangular 15 


.265 


.815 


.989 


Rectangular 25 


.265 


.811 


.988 


Peaked 5 


.265 


.811 


.988 


Peaked 15 


.265 


.811 


.983 


Peaked 25 


.265 


.811 


.988 


No Linking 


.265 


.811 


.988 



The efficiencies of tnese linking methods, relative to that 
achieved by using true parameters, clustered between .813 and .815. 
With no linking, the relative efficiency was .81U. With respect to 
the estimated parameters, the efficiencies of the anchor test pro- 
cedures ranged from .987 to .989, with no overall difference observed 
across anchor tests. The corresponding efficiency figure for the 
no-linking condition was .988. 
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Discussion 



Overall, the peaked anchor tests tended to perform most poorly 
when errors in item parameters were take$ as the criteria. There 
were few differences observed between the normal and rectangular 
tests but, when differences were found, they tended to favor the 
rectangular tests. In most cases, the indices of bias decreased 
with increased test length; the 15-item tests performed nearly as 
well as the 25-itera tests and better than the 5-item tests. There 
were essentially no differences across anchor test types and test 
lengths in the correlations between true and estimated item param- 
eters . 

More relevant to the study of linking methods are the character- 
istics of the asymptotic ability estimates produced by eaoh method. 
There were few differences observed across anchor test types in 
terms of their ability to produce estimates with a mean of zero and 
standard deviation of one, and in the absolute and root-mean-square 
errors in these estimates. When differences were found, they typi- 
cally indicated that the peaked tests were somewhat worse than the 
others. There were n6 consistent trends with test length. The cor- 
relations between the true and estimated ability were identical across 
all nine anchor tests. 

Perhaps most important in this study, however , were the indices 
of efficiency of the anchor test procedures. Essentially no differ- 
ences were found across anchor test types and test lengths; all 
efficiency figures were between t 987 and .989. 



Conclusions 



Analyses presented in this section have been, in part, a repli- 
cation of analyses done on the randomly sampled examinees. Examinees 
used in this section were randomly sampled fr^m a single population. 
The difference between these groups and those of the previous, data 
set was simply that the single population was redefined as hjving 
been selected, and thus skewed in distribution. 

* 

Many of the findings with the selected sample paralleled those 
of the randomly sampled data set. Specifically, equivalent-groups or 
no-linking methods produced pools of items as efficient, in terms of 
linking, as did the more complex anchoring methods. The equivalent-- 
tests method, as before, was inferior to the other methods. 

The anchoring methods were far superior to the equivalence and 
no-linking methods in reproducing the original standard ability metric 
This was sinply due to the fact that only the anchoring methods had 
information regarding the "correct" metric. 
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As a general conclusion, it appears that the equiv-alent-groups 
method is simple and effective for linking sets of items if examin- 
ees used in calibration are all sampled from a common population, 
regardless of its shape. If, however, the original metric must be 
reproduced, the equivalent-groups method has no way to reproduce it. 
Mixing items calibrated on a selected group with items calibrated on 
\a unselected group would be one example where an original, or at 
least a common, metric would need to be' reproduced. 
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VII. PRACTICAL APPLICATIONS OF LINKING 



Development of a Composite Approach 

The linking tasks the Armed Services must face in de ping 
adaptive-testing item pools can be reduced to two. First, *ne items 
comprising the initial pool will be calibrated in several sets on 
several groups and must be l?nked onto a common metric. Second, new 
items will be added to the poiJ at later dates and must be linked on- 
to the same metric. Data or< ited in the preceding sections provide 
good solutions to the first problen . Tiese solutions will be sum- 
marized below. Data presented in these sections provide some solu- 
tions to the second problem. More complex solutions, however, re- 
quire further analyses, (See Appendix C for a summary of a meeting 
with Air Force personnel in which the Armed Servxces linking problem 
was discussed.) 

The primary objective of linking is to produce a pool of items 
that will function together efficiently. Efficiency of th^ method 
is thus the most important criterion for choosing a method to link 
♦the initial pool. Since norms will undoubtedly be constructed on 
the basis of the metric of the initial pool, additional criteria must^ 
be considered in choosing a method for linking future items to the 
original pool. Specifically, addition of tb' new items should not 
distort the original metric and, therefore, a 'n»thod that produces 
little distortion should be chosen. Hence, the asymptotic-estimate 
criteria are *lso relevant to this linking problem. Discussion and 
analyses presented below will be limited to these relevant criteria. 

Linking the Initial Item Set— A Summary of Findings 

Given that the objective in calibrating aad linking the initial 
item pool is to obtain a set of iterrs that function efficiently, 
several methodological suggestions can be made. The ecjivalent- 
groups linking method using modal Bayesian scoring works as well as 
any of the more complicated linking procedures when examinees are 
randomly sampled from a common population. If ix, is possible to 
sample in this manner, there is no advantage to using a more compli- 
cated procedure. The method worked about equally well at all test 
lengths investigated. It exhibited a slight tendency toward greater 
efficiency with larger examinee samples, but these findings were in- 
consistent. Ilie differences were not sufficiently consistent to sug- 
gest whether 500, 1,000, or 2,000 .examinees should be used; in prac- 
tice, the largest available sample would probably be used. 

Analyses of calibration efficiency provided some guidance re- 
garding the sample size and test length necessary for item calibra- 
tion. Generally, larger samples and longer tests produced more 
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efficient parameter estimates^ If a tradeoff could be made between 
test length anct sample size, however, these analyses suggested that 
emphasis should be placed on increasing the test length, since in- 
creases in test length were three to four times as effective as pro- 
portionate increases in sample size. 

In the Armed Services environment, it is conceivable that new 
test items might be calibrated in conjunction with AFEES administra- 
tion of the current ASVAB. If the new items were to parallel a 
subtest on the ASVAB, this subtest would be a potential anchor test, 
but random distribution of experimental subtests across the AFEES 
population would elimfnate the need for an anchor test. Simul- 
taneous calibration of the new and old ASVAB i*"ms would, however, 
result in a longer oest and, therefore, better calibration so the 
two tests should be calibrated together, even if the ASVAB subtest 
is not used for linking. 

If random distribution were to prove impractical, the analyses 
of previous sections suggest that an anchoring method should be 
used. Either 100 anchor examinees or 15 to 25 anchor items would 
provide efficiency equivalent to that obtained by randomly sampling 
examinees. If the new items were to be administered concurrently 
with the ASVAB, the anchor-test method of linking would be an obvious 
choice. Previous analyses suggest that rectangular and normal anchor 
tests work about equally well,. Each of the present ASVAB subtests has 
an information curve which is similar to one of these two forms. 

Linking Across Time— Further Analyses 

An item pool, regardless of the care taken in its creation, is 
not likely to remain static forever. For a variety of reasons, new 
items wUl be added and old items will be removed during the life of 
the item pool. These new items must be calibrated and linked onto 
the metric of the original items. 

Since the examinee population is likely to change over time, 
the equivalent-groups procedure is not an appropriate method of link- 
ing the new items to the old. The equivalent- tests procedure, even if 
its assumptions ^ould be met, would still be an inefficient proce- 
dure. Given that individuals are likely to change over time, the 
anchor-group procedure would not be appropriate. 

The anchor-test method, if the anchor test remained constant, 
would be as efficient over time as it is at a single time. Therefore, 
it appears to be the method of choice for linking over time. If a 
constant anchor test can be maintained, linking over time will pro- 
duce no more difficulty than linking within a single time period. 

It is conceivable, however, to perform anchor-test linking 
using several anchor tests over time. A current ASVAB subtest may be 
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used as an anchor test for new items. These new items may be used to 
form a new ASVAB subtest. This new ASVAB subtest may then be used as 
an anchor test for linking the second new set of items. Before this 
cascading Procedure is attempted, however, it is important that its 
effects on efficiency and the ability metric be known. (This is 
probably an oversimplification of the problem since future versions of 
the ASVAB are likely to be adaptive. It provides a manageable model 
for analysis, however, and should provide some insight into the prob- 
lem.) 

Method. Item parameters and ability levels for a sample size of 
1000 and test lengths of 20, 35, 50, and 65 items were taken from the 
systematically sampled data set. This data set was chosen because 
each group within each of the four cells was sampled from a different 
population. This is analogous, to some extent, to what would happen 
if groups were sampled at different time periods. 

Within each cell, five calibration groups were arbitrarily 
ordered. The first group was linked, using the equivalent-groups 
procedure, to a standard (i.e., mean zero, variance one) population. 
(Note that this does not imply anchoring, and each initial group was 
linked to a different standard population.) Fifteen items were then 
selected from the test given to the first group as an anchor test. 
The first 15 were selected and, since the items in the tests were ran- 
domly ordered, represented a randomly sampled subset of items. These 
items were administered to the second calibration group and, using 
these items as an anchor test, the items in the second test were 
linked to the first. Fifteen items were selected from this linked 
second test and used to link the third test. This procedure was re- 
peated until the fifth test had been so linked. 

Asymptotic-ability-estimate and efficiency statistics were then 
calculated. They were calculated on the first test alone and then 
on each of the remaining tests in combination with the first. Cumu- 
lative effects of linking could thus be observed as more new tests 
were cascaded upon the old. 

Although the modal Bayesian scoring procedure had proved superi- 
or to the maximum-likelihood procedure when a single anchor test was 
used, it was not obvious to what extent its inherent bias would 
affect linking in a cascaded environment. The robust-maximum-likeli- 
hood procedure was thus additionally considered as an unbiased pro- 
cedure . 

Results. Table 72 presents asymptotic-ability-estimate means 
and standard deviations for cascaded linking using modal Bayesian 
scoring. The level of linkage refers to the number of linkages re- 
quired to link back to the original test. Average errors represent 
the average absolute deviation of the row or column entries from the 
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Table 72. Asymptotic Ability Metric 
of Cascaded Tests— Modal Bayesian Scoring 



Level of Test Length Average 

Linkage 20 35 50 65 Error 



Mean 0 .118 

1 .052 

2 -.152 

3 -.028 

4 .116 

Average Error .121 



Standard 0 1.136 

Deviation 1 1.057 

2 -.893 

3 .854 

4 .949 

Average Error .198 



.488 


.053 


.154 




.434 


.064 


-.032 


.079 


.337 


.047 


-.048 


.157 


.279 


-.027 


-.034 


.156 


.329 


-.009 


.073 


.076 


.143 


.040 


.164 


.117 


1 .189 


1 .089 


1.194 




1 .080 


.936 


.914 


.155 


.909 


.912 


.872 


.256 


.801 


.943 


.842 


.292 


.88u 


.918 


.387 


.244 


.272 


.161 


.315 


.237 



zero-level values. The zero-level values differ from each other be- 
cause no anchor method was used to anchor the first tests to any 
common metric. 

The most notable observation that can be made from the first 
half of Table 72 is that there were no apparent trends in error with 
increasing linkage distance at any of the four test lengths with 
respect to the means. The column with the most deviant starting 
value, .488. showed some tendency to drift toward zero but this trend 
was not co' ,, "i;*:»'«t . 

The standard deviations exhibited a tendency to drop with the 
first one or two linkages. After that they appeared to stabilize at 
approximately .9. No differences in this tendency were apparent 
across the various test lengths. 

Table 73 present, asymptotic-e. timate means and standard devia- 
tions for robust-maximim-likelihood scoring. Unlike the Ba/-sian 
procedure, the maximum-likelihood procedure showed a slight tendency 
to produce increasing means with increasingly distant linkages. This 
tendency was inconsistent, however. 
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Table 73. 


Asymptotic 


Ability Metric 




of Cascaded Tests — Maximum- 


Likelihood 


Scoring 




Level of 




Test Le 


f.£th 




Average 


Linkage 


20 


35 


50 


65 


Error 


Mean 0 


.079 


.406 


.048 


.103 




1 


.061 


.497 


.070 


.145 


.043 


2 


.120 


.537 


.06? 


.163 


.062 


3 


.225 


.592 


.040 


210 


. 1 12 


4 




ceo 
. 55o 


.044 


.247 


.100 


Average Error 


.075 


.140 


.012 


.088 


.079 


Standard 0 


.876 


.951 


.906 


1.018 




Deviation 1 


.815 


1 .015 


.945 


1.121 


.059 


2 


1.009 


1 .026 


.998 


1.123 


.101 


3 


1 .033 


1.107 


1.047 


1.183 


.167 


4 


.995 


1.073 


1.038 


1.232 


.147 


Average Error 


.123 


. 104 


. 101 


.146 


.119 



Standard deviations, u«1ng the robust-maximum-li^lihood proce- 
dure, rose rather than fell. By the third linkage, they were deviant 
from the initial values by .167, on the average. This dropped to 
.117 by the fourth linkage and may be indicative of a stabilization. 

Table 7 1 * presents linkage efficiencies of the cascaded tests 
using modal Bayesian scoring. No consistent trends in efficiency 
were observed. A slight inconsistent trend toward lower efficiency 
with increasing linkage distance and an inconsistent increasing trend 
with respect to test length were observed. The overall level of 
efficiency was somewhat lower than levels observed previously in the 
systematically sampled data set; efficiencies with Bayesian anchor-test 
linking using a constant anchor test were .970, compared to .929 here. 
It should be noted, however, that the conditions of linking were some- 
what different as five tests at a time were linked before, and only 
two at a time were linked here. 

Table 75 presents linkage efficiencies of the cascaded tests 
using robust-maximum-likelihood scoring. A more definite decreasing 
trend in efficiency with linkage distance was observed here than had 
been observed using Bayesian scoring. An inconsistent increasing 
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Table 74. Linkage Efficiency of 
Cascaded Tests— Modal Bayesian Scoring 



Level of Test Length 

Linkage 20 35 50 65 Average 



1 


.943 


.981 


.983 


.930 


.959 


2 


.874 


.914 


.954 


.918 


.915 


3 


.895 


.862 


.969 


.911 


.909 


4 


.958 


.883 


.959 


.936 


.934 


Average 


.918 


- .910 


.966 


.924 


.929 



Table 75. Linkage Efficiency of 
Cascaded Tests — Maximum Likelihood Scoring 



Level of Test Lengt h 

Linkage 20 35 50 65 Average 



1 


.968 


.962 


.993 


.972 


.974 


2 


.972 


.923 


.989 


.965 


.962 


3 


.865 


.892 


.967 


.940 


.917 




.920 


.911 


.972 


.863 


.917 


Average 


.931 


.922 


.980 


.935 


.942 



trend with respect to test length was again observed. In general, 
the maximum-likelihood scoring procedure produced somewhat more ef- 
ficient linkage than did the Bayesian procedure. Where the average 
linking efficiency was .929 for the Bayesian procedure, it was .942 
when maximum-likelihood scoring was used. 

Discussion . Linking using cascaded anchor tests with Bayesian 
scoring did net exhibit any substantial tendencies toward decreasing 
efficiencies with increasing linkage distances. Slightly more con- 
sistent tendencies toward lowered efficiency were observed with max- 
ima- likelihood scoring. Maximum-likelihood scoring produced slightly 
higher average efficiency than did Bayesian scoring across the con- 
ditions investigated. Slight trends in bias were observed with 
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respect to asymptotic standard deviations using either method but 
none were observed with respect to means or efficiencies. 



It should be noted that no trends were built into tha true abil- 
ities used in this simulation. Abilities of each group were differ- 
ent but not in any predictable fashion. If trends were present in 
the true abilities, a trend might be noted in the estimation errors. 
A substantial long-term trend in ability is unlikely to be observed 
in Armed Services testing, however. Short-term trends produced by a 
military draft situation are unlikely „o affect more than one or two 
generations of test items. Such a situation is similar to the one 
simulated here. 



Design for a Specific Application 



Following is an example of how the Information learned about' 
linking techniques in the preceding sections could be applied to a 
practical linking problem such as might be faced by the Armed Ser- 
vices. The problem presented below is one developed, in cooperation 
with Air Force personnel, to be representative of the linking problem 
the Armed Services will encounter in the development of an item pool 
for computerized adaptive administration of the ASVAB or its succes- 
sor. The problem described is presented only as a hypothetical ' link- 
ing environment. The test described, while intended to reflect 
expected conditions, is not based on specific studies and should 
not be considered optimal, in any sense, for test design. 

^s^rlgtl qn of the Problem 

A new adaptive version of the ASVAB is to be developed. It will 
contain 10 subtests, 8 of which will be power subtests. Only the 
power subtests will require calibration by <IRT methods. For each 
of these eight subtests, a pool of approximately 200 items will be de- 
veloped. These items will be similar to items previously used iti the 
ASVAB, with the exception that they will be written to cover the dif- 
ficulty range from b = -2.5 to b = 2.5. The distribution of difficulty 
is expected to be nearly rectangular with somewhat heavier representa- 
tion in the center. 

Examinees for use in calibration will come primarily from all the 
AFEES. One additional hour of examining time to take experimental 
tests will be provided for 1,000 examinees at each of the AFEES. This 
means that roughly 50 new items, on the average, can be administered 
along with the current ASVAB. The eight item pools, in total, will 
contain 1,600 items. If 65,000 examinees each take 50 items and the 
1,600 items are equally apportioned, each item will be administered 
to 2,031 examinees. 
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Some of the new subtests will parallel subtests on the current 
ASVAB; others will not. It is not essential that all individuals 
within a given AFEES take the same test. It is essential that the 
administration instructions and time requirements be identical for 
all experimental tests given vichin a single AFEES. 

The objective of calibration and linking of these items is to 
obtain eight item pools, each of which contains items which function 
efficiently together fjr estimating ability. The actual scale *on 
which the items are linked is not critical but f if the new items 
parallel an old ASVAB subtest, there should be a way of translating 
the new tost scores to the familiar ASVAB scores. Furthermore, there 
should be some provision by which new items can be added to a pool 
ani linked to the original metric. 

A Proposed Linking Design 

When applicable, the equivalent-groups method of linking pro- 
vides the most trouble-free and efficient linking available. Ic 
appears that t<?sts can be randomly distributed among AFEES if care 
is taken and thus the equivalent-groups procedure is the method of 
choice. The Bayesian scoring procedure is the preferred scoring 



Three major factors should be kept in mind when assembling the 
experimental tests. First, administrative constraints require that 
all tests use the same administration instructions and that each 
requires no more than an hour to complete. Second, calibration effi- 
ciency is enhanced witu longer tests. Third, calibration of each pool 
in equal-sized sets of items on equal numbers of examinees results in 
greatest linking efficiency. 

Prior to assembling the administration packets, rough time es- 
timates for completion of items in each of the pools should be ob- 
tained either from pilot administration or from past experience. 
Each pool shoula then be divided into the largest equal parts that 
can be administered within the time allowed. No item overlap is 



Examinees can be apportioned across the eight pools equally 
or unequally. If they are to be apportioned equally, the number 
of examinees can be decided by simply dividing 65,000 by the number 
of item subsets. It may be more appropriate, however, to apportion 
unequally. The number of examinees apportioned to each subtest may 
be decided by the relative importance of the pools, the relative 
ease of calibration of the various item types, the number of subtests 
within each of the item pools, or by other considerations. Samples 
used within a pool should be of equal size; samples for different 
pools do not need to be of equal size. 



method . 



requ 



ired. 
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Experimental tests should be randomly distributed among AFEES 
(and their mobile testing sites). While data presented in preceding 
sections have suggested that the equivalent-groups procedure works 
reasonably well even when test3 are systematically distriouted, non- 
randomness may result in the equivalent-groups method being less effi- 
cient than one of the anchoring methods. If the items in a pool par- 
allel an ASVAB subtest which is routinely administered to all exam- 
inees, the ASVAB items should be combined with each of the individual 
experimental tests when calibration is done. If distribution of test 
packets is done randomly, no explicit attempt at anchoring need be 
done; the purpose of including the ASVAB items is simply to increase 
calibration efficiency by increasing the test length. If distribution 
is non-random, explicit anchoring may be desirable. 

Conceptually, expressing scores of the new tests in teras of the 
told ASVAB scores may seem to be a simple matter of using the appropri- 
ate ASVAB subtest as an anchor test and t^en anchoring new items to 
it. Ability estimates from the new tests should, it seems, be equiv- 
alent to ability estimates from the old. There are two reasons why 
thi$ is not the case. For finite- length tests, regardless of the 
scoring procedure used, ability estimates will contain some error and 
be biased to at least a small degree. Unless the ability 3Stimates 
from the ASVAB subtest and the new items have equivalent error and 
bias, ability estimates of one will not be equivalent to the other, 
even if linking is perfect. Secondly, the old ASVAB is not expressed 
in an IRT atxility metric. Obviously, then, ability estimates from 
the. old ASVAB will not be equivalent to ability estimates from the 
neJ tests, even for infinitely long tests. 

So even after the item pools are linked, correspondence be- 
tween the new adaptive ASVAB and the old conventional ASVAB will not 
be immediately available. These correspondences can be developed by 
conventional equating procedures but only after the item pools are 
incorporated into a testing strategy and its error properties are 
known. 

Addition of new items to the pool at a later time will require 
an anchor test. The most straightforward choice for such a test 
is a conventional test composed of items from the original ASVAB or 
the original new item set and kept constant in composition for all 
future additions. Research in a previous section suggested, however, 
that new anchor tests can be selected as time passes with slight 
efficiency loss and little bias. Use of the new ASVAB as an adaptive 
anchor test is another possibility. Further research into adaptive 
anchor tests should be done before such a method is applied, however. 
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VIII. SUMMARY AND CONCLUSIONS 



Summary 



Previous Literature 

This report began with a review of the psychometric literature 
relevant to linking and equating which resulted In a number of find- 
ings. The first was a general framework for classification of link- 
ing and equating designs. Linking methods were classified on two 
general aspects: the design by which data are collected and the al- 
gorithm by which the linking transformations are made. The data 
collection designs were of four types: (a) sampling of equivalent 
examinees (equivalent-groups method), (b) sampling of equivalent 
Items (equivalent-tests method), (c) anchoring through a common group 
of examinees (anchor-group method), and (d) anchoring through a com- 
mon set of Items (anchor-test method). There were a variety of trans- 
formation algorithms which can be grouped Into linear, nonlinear, and 
Item-Response-Theory (IRT) methods. 

Since the overall research project was limited to linking of 
IRT-callbrated Items, the review concentrated on IRT linking and 
equating studies. The vast, majority of the reported studies used 
the Rasch IRT model. These tended to be more descriptive than evalu- 
ative. The more evaluative studies suggested that Rasch equating 
worked well for examinees of average or above average ability but 
worked poorly when low-ability groups were equated to higher-ability 
groups. This deficiency was probably due to the model 1 s inability 
to handle guessing. 

Among the studies investigating linking using the more appro- 
priate three-parameter IRT models, there was some confusion regarding 
the distinction between prediction, linking, and equating. A distinc- 
tion was made here by defining prediction as relating ^ores on on* 
psychological dimension to scores on another using regress. tech- 
niques, by defining equating as establishing a correspondence between 
two tests measuring the same psychological dimension using non-regres- 
sion techniques, and by defining linking as putting parameters of Items 
measuring the same psychological dimension on the same scale. Examples 
of research which inappropriately confounded these techniques were 
discussed . 

Linking Criteria 

The criteria used in past studies for evaluating the adequacy 
of calibration, linking, and equating were not only confusing but, 
typically, not useful for comparing various techniques. Two new 
classes of criteria were developed for use In this project. The 
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first considered tae asymptotic characteristics of ability estimates 
using estimated item parameters. Through this class of criteria, the 
biasing effects of calibration apd linking errors could be assessed. 
The second class of criteria consisted of the information and rela- 
tive efficiency of ability estimation resulting from the use of item 
parameters containing calibration and linking errors. These criteria 
were used to assess the relative test lengths required by the various 
methods to produce equivalent precision of measurement . Techniques 
for separating amounts of inefficiency due to calibration and to 
linking were presented. 

Simulation Design 

Considering deficiencies in previous studies of linking, a simu- 
lation study to determine appropriate linking methods was designed. 
In developing the simulation model, care was taken to ensure that the 
test items specified were similar (in terms of their item parameters) 
to Armed Services items likely to be encountered in actual linking 
problems, and that the populations of simulated examinees were defined 
to be similar in ability to those likely to take such tests. 

Item parameters were specified after analysis of available data 
on current ASVAB forms. Included in these data were IRT item param- 
eters for an experimental ASVAB form paralleling Form 7 and conven- 
tional item parameters from norming administrations of new ASVAB Forms 
8, 9, and 10. The ability distributions were determined from samples 
of 500 examinees from each of 65 AFEES responding to ASVAB Form 7. 

The distributions of both ability levels and item parameters 
were generated from the mean, variance, skew, and kurtosis of the 
AFEES or ASVAB distributions using a random number generator capable 
of generating distributions of shapes specified by these four moments. 
Three basic data sets were created. The first, the randomly sampled 
data set, contained five replications at each of 12 combinations of 
test length and calibration sample size and simulated the condition in 
which test booklets were randomly distributed among the entire AFEES 
population. The second, the systematically sampled data set, contained 
the same combinations of test length and sample size but simulated the 
condition in which test booklets were distributed systematically among 
relatively few AFEES. The third, the selected data set, contained 
only one sample size and simulated the condition in which a selected 
recruit population was used. 

Three categories of evaluative criteria were used to assess the 
adequacy of calibration and linking. The first category, fidelity 
of estimation, examined the relationships between true and estimated 
item parameters. Statistical indices jused included the familiar 
bias, absolute error, root-mean-square error, and correlation used 
in previous studies. The second category, characteristics of asymp- 
totic ability estimates, examined the relationships between true and 
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asymptotic (i.e., infinite-test-length) ability estimates. Statis- 
tical indices included the mean, standard deviation, absolute and 
root-mean-square error of the estimates, and the correlation between 
true and asymptotic ability. The last category, efficiency of abil- 
ity estimatior, included average item information (an index closely 
related to the precision of estimation) and relative efficiency, the 
ratio of information from two sources. In this study, efficiencies 
were computed relative to the true and estimated item parameters, 
yielding efficiency indices of the linked items and linking proce- 
dure, respectively. 



Results 

In evaluating the basic data sets, three general conclusions 
were reached. First, the parameter correlation data generally sup- 
ported other studies which assessed the calibration effectiveness of 
OGIVIA, the calibration program used in this study. The b parameters 
were very well estimated and the a and c parameters were less well 
estimated. Second, test length appeared to be relatively more impor- 
tant to calibration effectiveness than was sataple size; efficiency 
analyses suggested that increases in test length were at least three 
to four times as effective in improving calibration efficiency as pro- 
portionate increases in calibration sample sizes. Finally, there was 
little difference in calibration efficiency between randomly and sys- 
tematically sampled examinees, but there was a large difference in ef- 
ficiency between these and the selected examinee groups. 

In the randomly sampled data set, two general linking methods, 
the equivalent-groups and the equivalent- tests methods, were evalu- 
ated and compared. Comparisons were done in both a homogeneous link- 
ing condition, where the items to be linked were calibrated in tests 
of equal length using e <minee samples of equal size, and in a heter- 
ogeneous condition of mixed test lengths and examinee sample sizes. 

The fidelity-of-parameter-estimation analyses were unable to 
provide any conclusive evidence regarding which linking procedure 
was most effective. The asymptotic ability analyses, however, sug- 
gested that two linking procedures based on Bayesian ability estima- 
tion (an equivalent-groups procedure) were somewhat more effective 
than the others and that the equivalent-tests method was typically 
no better than not linking at all. The third set of analyses, those 
using the relative efficiency criteria, suggested that the equivalent- 
groups procedures were superior to the equivalent-tests procedures 
and that those using Bayesian scoring procedures were slightly superior 
to the others tested. Relatively-little efficiency was lost when the 
OGIVIA-produced parameters were used with no explicit linking. Effi- 
ciency loss due to linking error was always less than that due to 
calibration error and, although test length and sample size had a 
definite effect on calibration efficiency, no strong effects appeared 
with respect to linking efficiency. 
\ 
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In the systematically sampled data set, two additional linking 
method* were considered along with the equivalence methods. The 
anchor-group taethod linked item sets using common examinee groups 
of different sizes and compositions. The anchor-test method linked 
item sets using common tests of different sizes and compositions. In 
terms of linking efficiency, the anchor-test method produced the most 
efficient item pools. The anchor-group method resulted in efficien- 
cies equivalent to Inose of the anchor-test procedure if large groups 
were used, but with smaller groups the efficiencies dropped somewhat. 
The equivalence methods were somewhat less efficient than either of 
the anchor methods. Bayesian scoring was the method of choice. 
Maximum likelihood appeared not to be a useful scoring procedure 
for the linking conditions investigated. 

Results from analyses based on data from linking when examinees 
were selected tended to parallel those of the randomly sampled data 
set. The equivalent-groups and no-linking methods produced item 
pools as efficient as the more complex anchoring methods. These 
methods were ineffective in recovering the original metric, however. 
Mean asymptotic estimates were biased downward considerably from the 
true values, and standard deviations were larger than the true values. 
One of the more complex methods would have to be used if recovery of / 
the original metric was desired. ^ 

Application to a Practical Linking Problem 

An application of the results of this research to a practical 
linking pr^lero was described. The problem consisted of calibration 
and linking of item pools for computerized adaptive administration 
of the ASVAB. The general suggestion was that experimental test 
booklets be randomly distributed and equivalent-groups linking be 
used. For addition of items at later times, an anchor-test linking 
method was suggested. A further simulation was done to investigate 
the effect of cascaded anchor tests in whicri a new anchor test was 
created for each link. Neither excessive drift nor loss in efficien- 
cy was noted. It was concluded that such cascading could be used if 
necessary but that a constant anchor test should be preferred. When 
maximum- likelihood and Bayesian scoring procedures were compared, in 
the cascaded condition, the maximum-likelihood procedure showed a 
slight efficiency advantage over the Bayesian procedure. 



Conclusions 



If the item-linking procedures suggested in this report are 
followed, parameter errors due to imperfect linking should be a rela- 
tively miner problem in tne development of an adaptive-testing item 
pool. With proper procedures the efficiency loss due to linking 
errors should be approximately 1%. This is small in comparison to 
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the 10% to 12% efficiency loss due to calibration errors. This study 
thus appears to have answered the question: How should different 
item sets calibrated in different examinee groups be linked? 

Next to the findings regarding linking, perhaps the most impor- 
tant results of this project were the developments of new classes of 
criteria of calibration and linking adequacy. It is conceivable 
that calibration, noted to be a greater problem than linking, might 
be improved by using a different calibration program. Prior to this 
study, no adequate method of comparing calibration effectiveness of 
various calibration programs and algorithms had been available. The 
efficiency statistics presented here allow a direct comparison of var- 
ious procedures in terms of their capacity to provide parameters con- 
ducive to accurate estimation of ability. Since ability estimation 
is the objective of ability testing,' these criteria seem ideal. 

Analyses of the basic data sets using the program OGIVIA were 
presented in sufficient detail that they could easily be replicated 
using other calibration techniques. Evaluation of other calibration 
techniques using the efficiency criteria should quickly answer the 
question of which procedure is best. Since efficiency has a direct 
translatiorv,into test length, it should be useful in a cost-benefit 
analysis of the various procedures if the best procedure also should 
turn out to be the most expensive. 

The asymptotic-estimate criteria should have application in 
evaluating various equating methods. In this study, these criteria 
showed that, using estimates of the item parameters, the relationship 
between true and asymptotic ability was not perfectly linear. In 
populations such as those considered here, this did not appear to be 
a great problem. This nonlinearity may be a problem in the vertical 
equating of tests of widely different difficulty levels. It was not 
uncommon for tests investigated in this project to fail to yield abil- 
ity estimates much below -2.0. If two tests were substantially dif- 
ferent ir. difficulty and the parameters were less-than-per feet estim- 
ates, the relationship between the two tests might be nonlinear. This 
is an area that should be investigated before IRT vertically equated 
tests are used for real decisions. 

As a third area for application of the new criteria, efficiency 
analysis might be applied to investigating the appropriate number of 
parameters in an IRT moael. Rasch enthusiasts, and some others, have 
suggested that the Rasch model is the appropriate method to use be- 
cause other parameters in the mum-parameter models are too difficult 
to estimate. Using efficiency analysis, it should be possible to de- 
termine how many examinees and items are required for the additional 
parameters in a two- or three-parameter model to produce a net gain in 
measurement efficiency. 
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In summary, it is likely that there will be few questions con- 
cerning the development of Armed Services adaptive testing pools that 
cannot be answered from data presented in this report. Calibration 
presents somewhat more of a problem than does linking, but further 
research using criteria developed here should help solve this prob- 
lem. Finally, developments resulting from this project may aid in 
the solution of some other IRT-related psychometric problems. 

C 
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APPENDIX A— SUPPORTING TABLES 



Table A-1. Characteristics of the ASVAB 
General Science Subtest by AFEES 



AFEES N Mean SD Skew Kurtosls , 

1 500 .2759 .9975 -2559 -9677 
3 500 -2700 .9629 .2230 -6493 

5 500 .0424 .9233 -2850 -4963 

6 499 .1316 1 .0036 -3345 -5874 

7 500 .1577 .9717 -3273 -5866 
s 500 - 1 189 .9899 -0409 -.6064 
g 497 - 1391 .9960 -043* -7140 

10 500 .0586 .9589 -1956 -6268 

2 500 . 1587 .9123 -2064 -7096 
13 498 -0388 .9763 -1974 -«2*3 
,J 4 9 « .3436 .8849 --4363 -.3761 
j 5 5 oo -.3154 1 .0679 .0466 -7725 
16 500 .0173 1 -0550 -1409 -8760 
ifl 498 -.3935 1.0101 .0824 -.6752 

J 498 .0021 .9756 -0912 -8322 

20 497 .4389 .8544 -5075 -3148 

22 5C0 -.2880 .9980 .1660 -7573 

24 500 . 1239 .9449 -2193 -6742 

25 499 .3173 .9534 -5289 -4252 

26 500 .2643 .931 1 -3749 -4579 

27 498 -.5292 .9194 .3814 

?ft 499 -4400 .9658 .4163 -6887 

29 499 - 1850 .9564 -0341 -817, 

11 498 -2212 1 .0073 - 1015 -7309 
\\ 500 -.4460 .9945 .2912 -.6558 

32 500 -6476 .8614 .4003 -'635 

33 500 -2171 1.0002 .0805 -7691 

34 499 -0318 .9542 - 1562 -6444 

35 499 -5602 .9253 -42H -3806 
11 "98 -4483 .9480 . 1514 -4097 

37 4 -0875 .9508 - 1301 -638 

38 499 -4957 .9286 .2750 -372 
ft 500 .0943 .9005 - 1907 -6 1 

42 499 -0197 .9267 -0553 -5823 

43 499 - 1200 1 .0224 -0o94 -7847 
uj 499 -.0471 -8941 .0706 -.6153 

00 -.1833 .9828 .0308 -757 

46 500 -2542 1 .0306 .0859 -8044 

47 500 -4734 .9692 .2842 -4526 
H 499 .0146 .9763 -0841 -7965 
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Table A-1 . Characteristics of the ASVAB 
General Science Subtest by AFEES (Continued) 



AFEES 

49 

50 
51 
52 
53 
54 
55 
56 
57 

53 ' 

59 

60 

61 

62 

63 
64 

65 
66 
67 
68 
69 
70 
71 
72 
76 



N 




ou 


Skew 


498 


_ i r\ch 


07 c 1 


a 11 0 1 
-.0421 


500 




• y My 


. 245 1 


495 


— P7P1 

• ult 1 




. 1777 


500 




. y 3^y 


-. 4302 


499 




O 1 QO 
• i 1 oU 


-.4255 


498 




• yooo 


-.2421 


500 






-.5263 


499 


1 T7? 




O O A A 

-. 3o00 


500 






-.0894 


500 




1 aii at 


-. 381 1 


500 




QUI A 


C A 4 ll 

-.501 4 


499 




- yd/s 


1803 


499 






-.6893 


498 


-.0607 


1 1 f\0 


-.1127 


497 • 


.3890 


.9301 


-.3154 


500 


.4154 


.9066 


-.4386 


500 


.3866 


.9587 


-.4136 


500 


.0442 


.9446 


-.0944 


50C 


-.0438 


.9523 


-.0587 


50C 


.1077 


.9942 


-.2687 


497 


.2357 


.9770 


-.2619 


500 


.4520 


.8901 


-.5993 


499 


.2950 


.9245 


-.2888 


500 


.4413 


.9064 


-.6921 


498 


?95? 


.9114 


-.4368 



Kurtosis 

-.4645 
-.80'!4 
-.7167 
-.3879 
-.4789 

.5689 
-.3761 
-.4164 
-.8000 
-.6244 
-.4554 
-.7032 

.37 7 7 
-.8639 
-.7*74 
-.5772 
-.6086 
-.6210 
-.6479 
-.8586 
-.8596 
.5^96 
-.6018 
.2333 
-.3101 
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Table A-2. Items Selected for Inclusion in the 
Normal, Rectangular , and Peaked Anchor Tests 



Anchor Test 



Normal 



Rectangular 



True Item Parameters Estimated I tem Parameters 
a b c § 6 5 



2.2766 .0338 
1 .8243 -1.8344 
1.7780 1.9989 
1.8098 .423') 
3.8753 -.7242 
2.5663 -.3764 
1.9929 .3155 
1.5909 1.0338 
2.5162 -1.1096 
2.1169 -.5406 
2.6324 .6080 
2.3331 .7268 
2.1136 -1.2472 
2.2304 -1.6778 
2.2070 1.3933 



1.8899 
1.8079 
1 .5047 
1 . 8009 
1 .4296 
1 .7189 



-.0312 
-.3500 
-.5989 
.2759 
.7051 
-.9806 



1.8392 -1.5184 
1.6760 1.4379 
1.7338 -.1039 
1.3747 -.4329 

2.2T66 .0338 

1.8243 -1.8344 

2.3086 2.1240 

2.0131 .9706 

2.5162 -1.1096 

3.8753 -.7242 

1.9098 .4236 

2.2070 1.3933 

2.2304 -1.6773 

2.1136 -1.2472 

2.6324 .6050 
1.6750 1.4379 
1.8"»92 -1.5184 



. mo 1 


2.2717 


.0078 


.1059 


T763 


1.4526 - 


•2.3105 


.1748 


1891 


3.0000 


1.7863 


.0955 


1 170 


2.2358 


.4736 


.1079 




3.0000 


-.7405 


.1901 


1719 


2.3082 


-.4020 


.0924 




1 .9821 


.3446 


.1689 


1 10? 


1 .7310 


1.1774 


.1342 




2.1824 • 


-1 .1509 


.0059 


• C H C 


1 .6920 


-.6036 


.1106 


3 1711 


? H24 


.6768 


.2907 


.3429 


1 .9484 


.7717 


.3210 


.1364 


1.8686 


-1 .2710 


.0643 


mm 


2.0307 


-1 .5930 


. 1640 


7H67 

• juQ f 


3 0000 


1 .4893 


.2275 


1 on? 


1 .6845 


-.0108 


.1378 


?AQ5 


1 .7847 


-.2940 


.2531 




1 6149 


-.4958 


.1126 


?3?? 


1 659 7 


.3591 


.2240 


• c.c.O\j 


1 69?9 


.3457 


.2637 




1 .6022 


-1 .0177 


.2227 


1 10s 


1 .7279 


-1 .4533 


.0377 


1101 


1 .9381 


1.5048 


.3151 


P1 


1 . 4660 


-.1524 


.1183 




1 . 3737 


-.3864 


.1557 


.1401 


2.2717 


.0773 


. 1059 


.3763 


1.4526 


-2.3105 


.1746 


.1439 


2.4515 


2. 1056 


.1259 


.1966 


2.6381 


.9975 


. 1558 


.1104 


2.1324 


-1.1509 


.0059 


.2951 


3.0000 


-.7405 


.1901 


.1170 


2.2358 


.4736 


.1079 


.3067 


3.0000 


1.4893 


.2275 


.1435 


2.0307 


-1.5930 


. 1640 


.1364 


1 . 3666 


-1.2710 


.0643 


.3174 


2.3324 


.6768 


.2907 


.3101 


1.9381 


1.5048 


.3151 


.1105 


1.7279 


-1.4533 


.0377 
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Table A-2. Items Selected for Inclusion In the 
Normal, Rectangular, and Peaked Anchor Tests (Continued) 



True Item Parameters 



Estimated Item Parameters 



Anchor Test 



Rectangular 
(Cont.) 



Peaked 



a 




b 


c 


a 


6 


c 


1 .4949 


-1 


.9274 


. 1493 


1 2598 


-2 1565 


0Q77 


1.3346 


2 


.3002 


.1202 


2.1999 


2.3542 


.1633 


1.9929 




.3155 


. 1834 


1 .9821 


.3446 


.1689 


2.5663 




.3764 


.1719 


2.3o32 


-.4020 


.0924 


1.8353 




.7625 


. 1751 


1.5589 


-.8333 


.0606 


2.3331 




.7268 


.3429 


1 .9484 


.7717 


.3210 


1.5909 


1 


.0338 


.1102 


1 .7310 


1. 1774 


.1342 


1.7525 


-1 


.8702 


.2204 


1 .6999 


-1 .7462 


.2693 


1.3909 


-1 


.8031 


.1144 


1 .3265 


-1.8646 


.0699 


1.3883 


1 


.8744 


. 1674 


1 .9353 


1.9720 


. 1973 


1.8009 




.2759 


.2322 


1 .6597 


.3591 


.2240 


1.5617 


- 


.4916 


. 1561 


1.7318 


-.3962 


. 1286 


2.2755 




.0338 


. 1401 


2.2717 


.0778 


.1059 


2.5241 


- 


. 1973 


.2941 


2.2957 


-.1850 


.2327 


2.5663 


- 


.3764 


. 1719 


2.3082 


-.4020 , 


.0924 


2. 1322 


- 


.2409 


.1218 


1 .8271 


-.2715 


.0364 


1.9838 




.0308 


. 1765 


1 .8246 


.0432 


. 1243 


2. 1322 




. 1437 


. 1296 


1 .7626 


.1053 


.0733 


2.5678 


- 


.0124 


.2990 


2.0325 


-.0081 


.2535 


1 .7472 


- 


.2444 


.1108 


1 .7626 


-.1665 


. 1060 


1 .8899 


- 


.0312 


. 1902 


1 .6845 


-.0108 


.1373 


1 .8609 


- 


.4670 


.1111 


1.7860 


-.4194 


.0573 


2.1462 


- 


.3844 


. 1625 


1 .8270 


-.4245 


.0751 


2.8007 


- 


.4404 


.3155 


2.4904 


-.4772 


.2332 


2.2596 


- 


.0840 


.??09 


1 .5028 


-.1956 


.0838 


1.5617 




.4916 


. 1561 


1 .7318 


-.3962 


. 1286 


1.8079 




.3500 


.2895 


1 .7847 


-.2940 


.2531 


2. 1945 




.4153 


.2529 


1 .7370 


-.4213 


.1749 


1.7838 




.1039 


.2143 


1 .4560 


-.1524 


.1183 


2.1038 




.2952 


. 3263 


1 .6991 


-.3497 


.2141 


1.4159 




. 1788 


. 1443 


1 .4204 


-.1162 


.1102 


1.5732 




.2968 


.2128 


1 .5095 


-.2477 


.1697 


1.3253 




.1994 


.2239 


1 .4196 


-.3236 


.0758 


1.9929 




.3155 


. 1834 


1 .9821 


.3446 


. 1689 


1.7777 




.3484 


2414 


1 .5750 


-.3496 


.1905 


2.2933 




.2287 


. 3771 


1 .6622 


-.2980 


.2831 


2.2819 




.3800 


. 3237 


1 .7573 


-.4519 


.2241 
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APPENDIX B--REVISI0NS TO PROGRAM OGIVIA 



The item calibration program, OGIVIA, was obtained from James 
McBride of the Navy Personnel Research and Development Center in San 
Diego. The version received was written by Jerry Edwards of the 
University of Washington and had been revised and updated by John F. 
Gugel of the U.S. Civil Service Commission. A review of the program 
revealed several problems. Their possible impact and the corrections 
made are detailed below. 

A variant of the test information value was originally used for 
the scaling factor in the Newton-Raphson ability estimation routine. 
This factor was replaced with the second derivative of the log of the 
Bayesian posterior density function. In theory, this substitution 
should have made little difference in the ability and parameter esti- 
mates obtained. In fact, differences in the second and third decimal 
place were occasionally observed. This was assumed to be due to the 
fact that the criterion for termination of the iteration was a change 
in the absolute value of the estimate of less than 0.005 and that when 
the original scale factor was used, there was no assurance that the 
estimate was within 0.005 of the final value at this point. The dif- 
ferences were thus attributed to increased accuracy of estimate ob- 
tained with the modification. It was also noted that changing to the 
second derivative resulted in an average 20* decrease in the computer 
time required to estimate ability. 

Another inefficiency was noted in the Newton-Raphson procedure. 
It appeared that this procedure, by itself, was not always successful 
i- locating the modal Bayesian ability estimate. In some cases, the 
Bayesian posterior density function can be of a sufficiently irregu- 
lar shape that a starting value very near the final estimate is re- 
quired for convergence. The original program discarded examinees 
whenever the ability estimate failed to converge in 20 iterations. 
To preclude such examinee loss, the original algorithm was augmented 
by adding a bisection routine. The bisection was invoked whenever 
the Newton-Raphson procedure failed to converge within seven itera- 
tions. Following the bisection procedure, providing that a root 
existed in the interval -8.0 < e < 3.0 (a virtual certainty), the 
Newton-Raphson procedure was called again to refine the estimate and 
was allowed to iterate up to eight times. 

A final problem was encountered when OGIVIA discarded items 
whose parameter estimates exceeded pre-established bounds. While in 
practical calibration applications this may be an acceptable solution, 
in the present research design it presented a serious biasing effect 
on the comparisons of different cells in the design. To alleviate 
this problem, items whose parameter values would have caused them to 
be discarded were arbitrarily bounded as follows: 
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0.5 < a < 3/0, 
-3.0 < 6 < 3.0,, 
0.0 < 6 < 0.5. 

Although somewhat arbitrary, the."e values appear to reflect 
reasonable ranges for the parameters and seemed preferable to loss of 
the item. These item parameters were- bounded on both the first and 
second stages of the 0GIVIA program. 
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